2015-01-15 Class 2 Intro (Ch. 1) administrivia - textbook (& errata) - access to server - Python: ANTLR, LEPL, yapps compilers are *translators* - source -> target - basically a really complicated function - target can be a high-level language, low-level/assembly language, or machine code compiler topics widely relevant in PL, SE, arch, theory, security - Mike Hicks @ UMD - LockSmith, safe C, language-enforced security policies - Jeff Foster @ UMD - CQual, type systems/qualifiers, proof generators - Nate Rosenblum @ UWisc - compiler provenance, forensics, PASTE'10 paper - LLVM (generalized compiler framework) compiled vs. interpreted languages - everything is a translator! it's functions all the way down... compilers are a sequence of functional phases - front end (analysis) vs. back end (synthesis) - allows for more interoperability - symbol table and other orthogonal info used by many phases - common phases: (fig. 1.7) - lexical analysis (scanning) - syntax analysis (parsing) - semantic analysis (checking) - code generator (intermediate or target) - optimizer (machine dependent or independent) our Decaf compiler - source file -> scanner -> token stream - token stream -> parser -> abstract syntax tree (AST) - AST -> analyzer -> clean AST - clean AST -> code gen -> 3-address code - 3-addr code -> optimizer -> x86 assembly - x86 assembly -> [[GNU as]] -> x86 executable compiler construction tools - we will use lexer and parser generators models / data structures in compilers - regular expressions - finite automata (state machines) - context-free languages - push-down automata (parsers) - trees - graphs, matrices, linear programs generating "optimal" code is undecidable - use decisions and heuristics to pick which parts are worth tackling - basically an optimization problem key tension: abstraction vs. speed - more abstraction: higher read/writeability, cleaner complexity - less abstraction: closer to hardware, faster code - compilers must balance this tension high-performance computing - explicit instruction-level parallelism (SIMD; e.g., SSE) - implicit instruction-level parallelism (hidden in hardware) - explicit program-level parallelism (OpenMP, threads, MPI, etc.) - implicit program-level parallelism (HARD!) - register allocation - memory layout and caching (interacts with all levels) RISC vs. CISC - x86 won (at least for now) compiler error detection - type checking - bounds checking - memory management - security properties - sound (reported errors are real - no false positives) - complete (finds all real errors - no false negatives) - no "real" compiler is either PL history - 1st gen: machine codes macro instructions - 2nd gen: assembly languages - 3rd gen: "low-level" languages "high-level" languages - 4th gen: domain-specific languages PL categories - imperative / procedural - functional / declarative - object-oriented - scripting - domain-specific PL basics - static (compile time) vs. dynamic (run time) - scope: static (lexical) vs. dynamic (runtime or stack-based) - also public vs. private vs. protected - dynamic scope is also used in OO programming - environment (name -> location) - usually dynamic (static bindings are usually called "globals") - state (location -> value) - usually dynamic (static bindings are usually called "constants") - name vs. identifier (string of characters) vs. variable (location) - subprograms: - procedure (inlined code) vs. function (input->output) vs. method (OO) - first-class functions, lambdas, and closures - blocks (group of declarations and statements) - declaration (types and signatures) vs. definition (values and code) - parameter passing - formal (code/definition) vs. actual (passed during call) - call-by-value vs. call-by-reference vs. call-by-name - aliasing (two names refer to the same location)