1.7. Extended Example: Text Parser State Machine¶

Figure 1.7.1: A UML state model for a simple parser

In this Extended Example, we will illustrate how to use a state machine as a parser. This is a very simplified version of how compilers read in and parse source code. Figure 1.7.1 shows the state model that we will be using to implement the parser. This parser recognizes only the strings "int" and "integer", rejecting everything else. Both strings are expected to end with a space, leading to the Accept final state.

To reduce the size of the model, the character-recognition events (CHAR_*) are fired when the corresponding character is read from the input in the correct order. For instance, parsing the string "int" would fire three events in sequence: CHAR_i, CHAR_nt, then CHAR_nt. On the other hand, parsing string "itn" would fire the events in sequence: CHAR_i then CHAR_other, because the 't' would not be in the correct order. A more complete model would use distinct events for each character, as well as separate states to represent the full ordering; that is, 'i' would lead to a one state, then 'n' would lead to a second, and 't' would to a third. However, such a model is more difficult to include in the space constraints of the page here.

First, we complete the generic FSM handler, which we described in Code Listing 1.1 and 1.2. The key difference here is that there are some additional fields added to the fsm_t declaration. As each character of the input is processed, it is added to a buffer that keeps track of the string in progress. Also, to assist the controller in determining which events are valid due to the ordering of the characters, there are next_valid and next_event fields. For instance, when the FSM first enters the Int_Or_Integer state, only the character 'i' has been processed. Consequently the next_valid character would be 'n' and the next_event would be CHAR_nt. However, once the 'n' is read, the next_valid would change to 't' accordingly. The final field is used by the controlling driver program to determine if the string was read completely.

#ifndef __statemodel_h__
#define __statemodel_h__

#include <stdbool.h>
#include <sys/types.h>

/* States and events should just be integers */
typedef int state_t;
typedef int event_t;

/* Dummy struct for circular typedefs */
struct fsm;

/* Function pointer type declaration for effects */
typedef void (*action_t) (struct fsm *);

#define BUFFER_LENGTH 10

/* Generic FSM struct that could have other fields, as well */
typedef struct fsm {
  state_t state;  /* current state */
  size_t nevents; /* number of events for this FSM */

  /* pointer to the FSM's transition function */
  state_t (*transition) (struct fsm *, event_t, action_t *, action_t *);

  /* helper fields to keep track of the current input and the
     string built so far */
  char input;
  char buffer[BUFFER_LENGTH];
  size_t length;

  bool final; /* set on accept or reject */

  /* special handling: since there are self-transitions that are
     valid a limited number of times, use these fields to keep
     track of the next event to fire based on expected input */
  char next_valid;
  event_t next_event;
} fsm_t;

/* Invoke an event handler for a given FSM */
void handle_event (fsm_t *fsm, event_t event);

#endif

Next, we extend Code Listing 1.2 so that it can also perform entry activities on state changes.

#include <stdio.h>
#include <stdlib.h>

#include "statemodel.h"

void
handle_event (fsm_t *fsm, event_t event)
{
  /* Confirm the event is valid for the given FSM */
  if (event >= fsm->nevents)
    return;

  /* Use the FSM's lookup tables; if next is -1, the event is not
     Defined for the current state */
  action_t effect = NULL;
  action_t entry = NULL;
  state_t next = fsm->transition (fsm, event, &effect, &entry);
  if (next == -1)
    return;

  /* Perform the effect (if defined) */
  if (effect != NULL)
    effect (fsm);

  /* Perform the new state's entry action (if defined) */
  if (entry != NULL)
    entry (fsm);

  /* Change the state */
  fsm->state = next;
}

The "parse.h" header file defines the interface to this specific FSM. To control the parser FSM, a driver program would need access to the event names (defined by the parse_event_t type) and the name of the initialization function.

#ifndef __parse_h__
#define __parse_h__

#include "statemodel.h"

/* Valid events for this type of FSM */
typedef enum {
  CHAR_i, CHAR_nt, CHAR_e, CHAR_ger, CHAR_other, SPACE, NIL
} parse_event_t;

fsm_t *parse_init (void);

#endif

The next piece of code extends Code Listing 1.3 with the information needed for the parser structure. As before, there’s an initialization function (parse_init()) that sets up the internals of the fsm_t. The append_character() function keeps track of how many times each particular state has been visited and what character is expected next. Again, a more verbose FSM would not need this information. Finally, the accept() and reject() functions implement the transition effects.

#include <assert.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "statemodel.h"
#include "parser.h"

/* Internal type definition of states */
typedef enum { ENTER, IOI, IO, ACCEPT, FAIL, NST } parse_t;
#define NUM_STATES (NST+1)
#define NUM_EVENTS (NIL+1)

/* Prototypes for actions/effects and transition function */
static void append_character (fsm_t *);
static void accept (fsm_t *);
static void reject (fsm_t *);
static state_t parse_transition (fsm_t *, event_t, action_t *, action_t *);

/* Return an FSM that links to these internals */
fsm_t *
parse_init (void)
{
  fsm_t *fsm = calloc (1, sizeof (fsm_t));
  fsm->nevents = NUM_EVENTS;
  fsm->state = ENTER;
  fsm->transition = parse_transition;

  /* Set up internal fields for keeping track of characters */
  fsm->input = 0;
  memset (fsm->buffer, 0, sizeof (fsm->buffer));
  fsm->length = 0;

  fsm->next_valid = 'i';
  fsm->next_event = CHAR_i;
  fsm->final = false;

  return fsm;
}

/* Lookup table for transitions; row=state, column=event */
static parse_t const _transition[NUM_STATES][NUM_EVENTS] = {
  // C_i C_nt C_e  C_ger C_oth SPACE
  { IOI, NST, NST, NST,  FAIL, NST    }, // Enter
  { NST, IOI, IO,  NST,  FAIL, ACCEPT }, // Int_Or_Integer
  { NST, NST, NST, IO,   FAIL, ACCEPT }, // Integer_Only
  { NST, NST, NST, NST,  NST,  NST    }, // Accept
  { NST, NST, NST, NST,  NST,  NST    }, // Fail
};

/* Lookup table for effects; row=state, column=event */
static action_t const _effect[NUM_STATES][NUM_EVENTS] = {
  // C_i  C_nt  C_e   C_ger C_oth   SPACE
  { NULL, NULL, NULL, NULL, reject, NULL   }, // Enter
  { NULL, NULL, NULL, NULL, reject, accept }, // Int_Or_Integer
  { NULL, NULL, NULL, NULL, reject, accept }, // Integer_Only
  { NULL, NULL, NULL, NULL, NULL,   NULL   }, // Accept
  { NULL, NULL, NULL, NULL, NULL,   NULL   }, // Fail
};

/* Lookup table for state entry actions */
static action_t const _entry[NUM_STATES] = {
  NULL, append_character, append_character, NULL, NULL
};

/* Given FSM instance and event, perform the table lookups */
static state_t
parse_transition (fsm_t *fsm, event_t event, action_t *effect, action_t *entry)
{
  /* If the state/event combination is bad, return -1 */
  if (fsm->state >= NST || event >= NIL || _transition[fsm->state][event] == NST)
    return -1;

  /* Look up the effect and transitions in the tables */
  *effect = _effect[fsm->state][event];

  /* Look up the next state in the list of entry events */
  state_t next = _transition[fsm->state][event];
  if (next != NST)
    *entry = _entry[next];

  return next;
}

/* Appends the current character to the buffer */
static void
append_character (fsm_t *fsm)
{
  assert (fsm->length < BUFFER_LENGTH - 1);
  fsm->buffer[fsm->length++] = fsm->input;

  /* Helper inputs to keep track of what the next event will be */
  switch (fsm->input)
    {
    case 'i':
      fsm->next_valid = 'n';
      fsm->next_event = CHAR_nt;
      break;
    case 'n':
      fsm->next_valid = 't';
      fsm->next_event = CHAR_nt;
      break;
    case 't':
      fsm->next_valid = 'e';
      fsm->next_event = CHAR_e;
      break;
    case 'e':
      if (! strcmp (fsm->buffer, "inte"))
        fsm->next_valid = 'g';
      else
        fsm->next_valid = 'r';
      fsm->next_event = CHAR_ger;
      break;
    case 'g':
      fsm->next_valid = 'e';
      fsm->next_event = CHAR_ger;
      break;
    default:
      fsm->next_valid = '\0';
      fsm->next_event = NIL;
    }
}

/* Prints that the string is accepted */
static void
accept (fsm_t *fsm)
{
  printf ("ACCEPT: \"%s\"\n", fsm->buffer);
  fsm->final = true;
}

/* Prints that the string is rejected and at what character */
static void
reject (fsm_t *fsm)
{
  printf ("REJECT: \"%s\" at character '%c'\n", fsm->buffer, fsm->input);
  fsm->final = true;
}

The final piece of code for this parser is the main driver program. This program reads in the command-line arguments and attempts to parse each one as beginning with either "int" or "integer". For each provided input, this controller will use parse_init() to create a new FSM instance. The code in lines 37 – 48 start by checking for a space (indicating the first token of the input has been read) or for the expected next character (based on the information stored in the FSM). Based on these checks, line 52 fires the event and the pointer moves on to the next character. If all characters have been read, there are three possibilities: 1) a space was found and the match was a success 2) the match already failed, or 3) the input has matched a prefix of "int" or "integer", but there was no space to terminate the string. To resolve case 3, we just need to check the strings. [1]

#include <stdio.h>
#include <string.h>

#include "statemodel.h"
#include "parser.h"

int
main (int argc, char *argv[])
{
  /* Inputs to parse. Matches will be exactly "int", "integer",
     or begin with "int " or "integer ". */
  if (argc < 2)
    {
      fprintf (stderr, "Must pass at least one string argument\n");
      fprintf (stderr, "Sample command line:\n");
      fprintf (stderr, "  ./ext int char \"i n t\" integ\n");
      return 1;
    }

  for (int i = 1; i < argc; i++)
    {
      printf ("Processing input: \"%s\"\n", argv[i]);

      /* Create a new parser instance for each input */
      fsm_t *fsm = parse_init ();
      char *walker = argv[i];
      event_t evt = NIL;

      /* Traverse through each character in the input */
      while (*walker != '\0')
        {
          fsm->input = *walker;

          /* If the input is a space, check if the string
             match is successful. Otherwise, check if the
             next character matches what is expected. */
          if (fsm->input == ' ')
            {
              if (! strcmp (fsm->buffer, "int") || ! strcmp (fsm->buffer, "integer"))
                evt = SPACE;
              else
                evt = CHAR_other;
            }
          else if (fsm->input == fsm->next_valid)
            evt = fsm->next_event;
          else
            evt = CHAR_other;

          /* The event should never be NIL because of the
             logic above, but this style is for consistency. */
          if (evt != NIL)
            handle_event (fsm, evt);

          /* If a string is accepted or rejected, we are done. */
          if (fsm->final)
            break;

          /* Move to the next input character. */
          walker++;
        }

      /* At the end of the input. Either we have matched exactly
         "int" or "integer", or we have matched a substring (such
         as "inte"). Substrings should be rejected. Use '$' to
         indicate an END-OF-STRING character. */
      if (!fsm->final)
        {
          fsm->input = '$';
          if (! strcmp (fsm->buffer, "int") || ! strcmp (fsm->buffer, "integer"))
            handle_event (fsm, SPACE);
          else
            handle_event (fsm, CHAR_other);
        }
      printf ("\n");
    }

  return 0;
}

[1] One may object at this point and ask why we didn’t just compare the strings from the beginning and avoid all of the FSM complexity. The answer is that this example is one miniscule component of a much larger system: a compiler. The state models used for compilers are significantly larger and distinguish keywords (such as "int") from programmer-defined variable names (such as "interest_rate"). Simple string comparison is not a feasible approach to such complex work.