2015-01-05 Parsing (Ch. 4) three parser categories: - bottom-up - top-down - universal error recovery - goals: - early detection of errors - informative error messages - types of errors: - lexical errors - syntactic errors - semantic errors - logical errors context-free grammar: {V,Σ,R,S} - V: set of non-terminals "variables" - Σ: set of terminals "tokens" - R: relation: V -> (V ∪ Σ)* "productions" - S: s ∈ V "start symbol" derivations and parse trees - start with S - repeatedly "expand" S using productions from R - the "=>" operation in the book - each expansion creates a new node in the parse tree - leftmost vs. rightmost derivations - ambiguity: two or more parse trees for the same sentence regular languages vs. context-free languages - canonical example: verifying matched parentheses - "finite automata cannot count" - regular languages preferred for lexical analysis - modularization - simplicity and readability - efficiency grammar modifications - eliminating ambiguity no general procedure - eliminating immediate left recursion A -> Aα | β becomes A -> βA' A' -> αA' | ε - eliminating general left recursion (Alg. 4.19) (ex. 4.20) - left factoring (ex. 4.22) A -> αβ_1 | αβ_2 becomes A -> αA' A' -> β_1 | β_2 non-context-free language features: - variables must be declared before use - number of formal and actual parameters must match top-down parsing: LL(k) - recursive descent routines (Fig. 4.13) - processes input left-to-right - builds leftmost derivation - uses k lookahead tokens - FIRST and FOLLOW sets - predictive parsing table - implicit vs. explicit stack FIRST(α): set of terminals that begin strings derived from α - defined for terminals and non-terminals - rule 1: α is a terminal a FIRST(a) = {a} - rule 2: α is a non-terminal X examine all productions X -> Y_1 Y_2 ... Y_k add FIRST(Y_i) if Y_1 ... Y_j => ε, where j = i-1 FIRST(X) is union of all of the above - rule 3: α is a non-terminal X and X -> ε FIRST(X) includes ε FOLLOW(A): set of terminals that can appear to the right of A in a sentential form - only defined for non-terminals - rule 1: FOLLOW(S) includes $ - rule 2: for every production A -> αBβ FOLLOW(B) includes everything in FIRST(β) except ε - rule 3: if A -> αB or (A -> αBβ and FIRST(β) contains ε) FOLLOW(B) includes everything in FOLLOW(A) predictive parsing for LL(1) grammars - can handle most standard PL constructs - requires a carefully-constructed grammar - no left recursion or ambiguity - must be deterministic with one symbol of lookahead LL(1) conditions: if A -> α | β, then - α and β cannot derive strings starting with the same terminal - only one of α and β can derive ε - if α => ε then β cannot derive a string starting with a terminal from FOLLOW(A) - if β => ε then α cannot derive a string starting with a terminal from FOLLOW(A) - known as the "pairwise disjoint" test non-recursive predictive parsing - use a stack to track the parsing status (Fig. 4.19) def parse (str): st = Stack() st.push($) st.push(S) while !st.is_empty(): x = st.pop() a = str[0] if x == a: str = str[1:] elsif is_terminal(x): error() elsif M[x,a] == null: error() else M[x,a] == (x -> syms): for s in syms.reverse: push(s)