2015-01-05
Parsing (Ch. 4)

three parser categories:
  - bottom-up
  - top-down
  - universal

error recovery
  - goals:
    - early detection of errors
    - informative error messages
  - types of errors:
    - lexical errors
    - syntactic errors
    - semantic errors
    - logical errors

context-free grammar: {V,Σ,R,S}
  - V: set of non-terminals         "variables"
  - Σ: set of terminals             "tokens"
  - R: relation: V -> (V ∪ Σ)*      "productions"
  - S: s ∈ V                        "start symbol"

derivations and parse trees
  - start with S
  - repeatedly "expand" S using productions from R
    - the "=>" operation in the book
  - each expansion creates a new node in the parse tree
  - leftmost vs. rightmost derivations
  - ambiguity: two or more parse trees for the same sentence

regular languages vs. context-free languages
  - canonical example: verifying matched parentheses
  - "finite automata cannot count"
  - regular languages preferred for lexical analysis
    - modularization
    - simplicity and readability
    - efficiency

grammar modifications
  - eliminating ambiguity
     no general procedure

  - eliminating immediate left recursion
     A  -> Aα | β
       becomes
     A  -> βA'
     A' -> αA' | ε

  - eliminating general left recursion (Alg. 4.19) (ex. 4.20)

  - left factoring (ex. 4.22)
     A  -> αβ_1 | αβ_2
       becomes
     A  -> αA'
     A' -> β_1 | β_2

non-context-free language features:
  - variables must be declared before use
  - number of formal and actual parameters must match

top-down parsing: LL(k)
  - recursive descent routines (Fig. 4.13)
  - processes input left-to-right
  - builds leftmost derivation
  - uses k lookahead tokens
  - FIRST and FOLLOW sets
  - predictive parsing table
  - implicit vs. explicit stack

FIRST(α):  set of terminals that begin strings derived from α
  - defined for terminals and non-terminals

  - rule 1: α is a terminal a
      FIRST(a) = {a}

  - rule 2: α is a non-terminal X
      examine all productions X -> Y_1 Y_2 ... Y_k
        add FIRST(Y_i) if Y_1 ... Y_j => ε, where j = i-1
      FIRST(X) is union of all of the above

  - rule 3: α is a non-terminal X and X -> ε
      FIRST(X) includes ε

FOLLOW(A): set of terminals that can appear to the right of A
           in a sentential form
  - only defined for non-terminals

  - rule 1: FOLLOW(S) includes $

  - rule 2: for every production A -> αBβ
      FOLLOW(B) includes everything in FIRST(β) except ε

  - rule 3: if A -> αB or (A -> αBβ and FIRST(β) contains ε)
      FOLLOW(B) includes everything in FOLLOW(A)

predictive parsing for LL(1) grammars
  - can handle most standard PL constructs
  - requires a carefully-constructed grammar
    - no left recursion or ambiguity
    - must be deterministic with one symbol of lookahead

LL(1) conditions: if A -> α | β, then
  - α and β cannot derive strings starting with the same terminal
  - only one of α and β can derive ε
  - if α => ε then β cannot derive a string starting with a terminal from FOLLOW(A)
  - if β => ε then α cannot derive a string starting with a terminal from FOLLOW(A)
  - known as the "pairwise disjoint" test

non-recursive predictive parsing
  - use a stack to track the parsing status (Fig. 4.19)

    def parse (str):
      st = Stack()
      st.push($)
      st.push(S)
      while !st.is_empty():
        x = st.pop()
        a = str[0]
        if x == a:             str = str[1:]
        elsif is_terminal(x):  error()
        elsif M[x,a] == null:  error() 
        else  M[x,a] == (x -> syms):
          for s in syms.reverse: push(s)