.. _StateModelImplementation:
.. raw:: html
.. |--| unicode:: U+2013 .. en dash
.. |---| unicode:: U+2014 .. em dash, trimming surrounding whitespace
:trim:
.. This file is part of the OpenCSF eTextbook project. It was
.. auto-generated by scripts from the OpenDSA eTextbook project.
.. See https://OpenCSF.org for more details. OpenCSF is distributed
.. under a Creative Commons Attribution-NonCommercial 4.0 International
.. License (see http://creativecommons.org/licenses/by-nc/4.0/),
.. Copyright (c) 2019-2021 by Michael S. Kirkpatrick. OpenDSA is
.. distributed under an MIT open source license, Copyright (c) 2012-2021
.. by the OpenDSA Project Contributors.
.. avmetadata::
:author: Michael S. Kirkpatrick
:requires:
:satisfies: State Model Extended Example
:topic: State Model Extended Example
Extended Example: Text Parser State Machine
===========================================
.. _StateParser:
.. figure:: Images/CSF-Images.1.12.png
:align: right
:width: 95%
:figwidth: 55%
:alt: A UML state model for a simple parser
A UML state model for a simple parser
In this Extended Example, we will illustrate how to use a state machine as a parser. This is a very
simplified version of how compilers read in and parse source code. :num:`Figure #StateParser`
shows the state model that we will be using to implement the parser. This parser recognizes only the
strings ``"int"`` and ``"integer"``, rejecting everything else. Both strings are expected to end
with a space, leading to the Accept final state.
To reduce the size of the model, the character-recognition events (``CHAR_*``) are fired when the
corresponding character is read from the input in the correct order. For instance, parsing the
string ``"int"`` would fire three events in sequence: ``CHAR_i``, ``CHAR_nt``, then ``CHAR_nt``. On
the other hand, parsing string ``"itn"`` would fire the events in sequence: ``CHAR_i`` then
``CHAR_other``, because the ``'t'`` would not be in the correct order. A more complete model would
use distinct events for each character, as well as separate states to represent the full ordering;
that is, ``'i'`` would lead to a one state, then ``'n'`` would lead to a second, and ``'t'`` would
to a third. However, such a model is more difficult to include in the space constraints of the page
here.
First, we complete the generic FSM handler, which we described in `Code Listing 1.1 <#cl1-1>`_ and `1.2 <#cl1-2>`_. The key
difference here is that there are some additional fields added to the ``fsm_t`` declaration. As each
character of the input is processed, it is added to a buffer that keeps track of the string in
progress. Also, to assist the controller in determining which events are valid due to the ordering
of the characters, there are ``next_valid`` and ``next_event`` fields. For instance, when the FSM
first enters the ``Int_Or_Integer`` state, only the character ``'i'`` has been processed.
Consequently the ``next_valid`` character would be ``'n'`` and the ``next_event`` would be
``CHAR_nt``. However, once the ``'n'`` is read, the ``next_valid`` would change to ``'t'``
accordingly. The ``final`` field is used by the controlling driver program to determine if the
string was read completely.
.. codeinclude:: IntroConcSys/ExtEx-1.1.c
:linenos: true
Next, we extend `Code Listing 1.2 <#cl1-2>`_ so that it can also perform entry activities on state changes.
.. codeinclude:: IntroConcSys/ExtEx-1.2.c
:linenos: true
The ``"parse.h"`` header file defines the interface to this specific FSM. To control the parser FSM,
a driver program would need access to the event names (defined by the ``parse_event_t`` type) and
the name of the initialization function.
.. codeinclude:: IntroConcSys/ExtEx-1.3.h
:linenos: true
The next piece of code extends `Code Listing 1.3 <#cl1-3>`_ with the information needed for the parser
structure. As before, there's an initialization function (``parse_init()``) that sets up the
internals of the ``fsm_t``. The ``append_character()`` function keeps track of how many times each
particular state has been visited and what character is expected next. Again, a more verbose FSM
would not need this information. Finally, the ``accept()`` and ``reject()`` functions implement the
transition effects.
.. codeinclude:: IntroConcSys/ExtEx-1.4.c
:linenos: true
The final piece of code for this parser is the main driver program. This program reads in the
command-line arguments and attempts to parse each one as beginning with either ``"int"`` or
``"integer"``. For each provided input, this controller will use ``parse_init()`` to create a new
FSM instance. The code in lines 37 – 48 start by checking for a space (indicating the first token
of the input has been read) or for the expected next character (based on the information stored in
the FSM). Based on these checks, line 52 fires the event and the pointer moves on to the next
character. If all characters have been read, there are three possibilities: 1) a space was found and
the match was a success 2) the match already failed, or 3) the input has matched a prefix of
``"int"`` or ``"integer"``, but there was no space to terminate the string. To resolve case 3, we
just need to check the strings. [#f6]_
.. codeinclude:: IntroConcSys/ExtEx-1.5.c
:linenos: true
.. [#f6] One may object at this point and ask why we didn't just compare the strings from the
beginning and avoid all of the FSM complexity. The answer is that this example is one miniscule
component of a much larger system: a compiler. The state models used for compilers are
significantly larger and distinguish keywords (such as ``"int"``) from programmer-defined variable
names (such as ``"interest_rate"``). Simple string comparison is not a feasible approach to such
complex work.