2015-01-27 Class 5 Lexing (3.1-3.5.5 calc project - due this Friday (Feb 30) - submit via Canvas scanning vs. lexing - their definition of "scanning" is basically "cleaning" - this distinction is unnecessary token categories - keyword - symbol - operator - punctuation mark - identifier - literals interesting Fortran issue (inset on pg. 113) error recovery - methods: deletion, insertion, replacement, transposition - often too complicated and expensive - most compilers just fail with a formatting error buffering - buffer pairs - sentinels - largely irrelevant language operations (Fig. 3.6) regular expressions - inductive definition - base cases: empty string (ε) single symbol (a ∈ Σ) - recursive cases: alternation (a|b) contatenation (ab) quantification (a*) - precedence (high to low): *, concat, | - extensions positive closure (a+) optional (a?) character classes ([a-z]) finite automata (state machines) - represented using transition diagrams - examples: Figs. 3.13-3.17) flex example - source on server discussion of decaf project - "Lexical Considerations" section in document - difficulty: ANTLR does not differentiate between lexer and parser - similar to issue of compiling lexer separately w/ lex & yacc - probable solution - final lexer+parser due Feb 27 - in-progress due Feb 13 (must see some lexing progress) - Canvas discussions per language next time: finite automata - DFA definition: (Σ, Q, q_0, F, δ) - NFA definition: DFA w/ multiple paths and ε-transitions - regex -> NFA conversion algorithm - NFA -> DFA conversion algorithm - DFA minimization algorithm - automata in lexical analysis