2015-01-29
Class 6
Finite Automata (3.6-3.9)

finite automata
  - state machines
  - definition: (Σ, Q, q_0, F, δ)

    Σ    alphabet (set of input symbols)
    Q    states
    q_0  initial (start) state
    F    set of final states
    δ    transition function: (Q, Σ) -> Q

  - transition diagrams (Fig. 3.24)
  - transition tables   (Fig. 3.25)

  - deterministic vs. non-deterministic
    - NFA definition: DFA w/ multiple paths and ε-transitions
        δ: (Q, (Σ ∪ {ε})) -> [Q]
    - ε-closure(s): all states reachable from s via ε-transitions
        see Fig. 3.33

  - purpose: "recognize" sentences in regular languages

    DFA algorithm:
      s := q_0
      for each input c:
        s := δ(s,c)
      return s ∈ F

    NFA algorithm:
      S := ε-closure(q_0)
      for each input c:
        T := {}
        for each s in S:
          T := T ∪ ε-closure(δ(s,c))
        S = T
      return |S ∩ F| > 0
      
equivalence of regexes, NFAs, and DFAs
  - transitions: RE -> NFA -> DFA -> RE

  - RE -> NFA:  inductive construction (Alg. 3.23)
      Fig. 3.40, 3.41, 3.42
      (add new states and ε-transitions)
      constant size increase at each level

  - NFA -> DFA: subset construction (Alg. 3.20)
      Fig. 3.32, 3.33
      (simulate all possible NFA executions)
      possible exponential increase in # of states
        power set of NFA states: 2^|Q|

  - DFA -> RE:  informal concept
      (gradually remove states & label transitions with REs)

efficiencies
  - Fig. 3.48
  - NFAs: quicker to build, slower to accept
      constant expense as regex grows
      simulation must track list of possible states
      (better for one-off applications like grep)
  - DFAs: slower to build, faster to accept
      potentially expensive conversion phase
      only one possible state at any given time
      (better for lexical analysis)

generating lexical analysis routines
  - combine NFAs for all individual patterns
    - add new start state with ε-transitions to start of each NFA
  - heuristics for resolving conflicts
    - choose longest prefix that matches some pattern
    - choose pattern that appears first in the specification
  - approach 1: use NFA directly
    - scan until next state set (T) is empty
    - look backwards through path to find a state set with a final state
  - approach 2: convert to DFA first
    - scan until next state is a dead state
    - look backwards through path to find a final state
  - handling lookahead constructs
    - scan past the pattern until the lookahead context is matched
    - backtrack and only accept pattern

optimizations
  - RE -> DFA direct conversion (Alg. 3.36 & Fig. 3.62)
    - parse RE into syntax tree
    - calculate functions for leaf nodess
      - nullable, firstpos, lastpos, followpos
    - use these functions to build DFA
      - start with state representing firstpos(n_0)
      - add transitions on each input in Σ (and destination states if necessary)
      - new states represent unions of followpos(p)
        - for each p represented in the current state that represents a RE
          syntax node with the desired input

  - DFA minimization
    - basic idea: collapse states that have identical behaviors
    - create initial partitions containing non-final and final states
    - repeatedly split partitions if there is an input that "distinguishes"
          between states in the partition

  - DFA transition table compaction
    - eliminate many duplicate values ("default" table)
    - de-duplicate entries for similar states ("next"/"check" tables)