Essential Compiler Front-End Knowledge for Code Visualization
Introduction
The concept and industry applications of code visualization have been covered in previous articles. For an overview, refer to "Analysis of Code Visualization" and check the "Code Visualization" column for more related content. This article outlines the compiler front-end knowledge required for developing code visualization functionality. Due to limitations in expertise, any unclear explanations or errors should be forgiven. For more in-depth formal learning, please refer to the extended reading materials provided at the end.
Compiler Overview
This section focuses on front-end and mid-end theoretical knowledge. The back-end components, which deal with target machine code and specific machine architectures, are rarely used in visualization. This article primarily covers front-end components, with mid-end topics to be addressed in a separate article later.
Compiler Front-End
Lexical Analysis
Theoretical Concepts
Lexical analysis, also known as scanning, involves reading the character stream that constitutes the source program and organizing it into a sequence of meaningful lexemes. A lexeme is the smallest language unit in the source program, such as keywords, identifiers, constants, operators, and separators. For each lexeme, the lexical analyzer produces a corresponding token as output.
Token: <Type Code, Attribute Value>
The core logic of a lexical analyzer is based on Finite Automata, which can be understood as an automatically executing machine with a finite number of states used to map scanned characters to a finite set of possibilities. Types include:
- Nondeterministic Finite Automaton (NFA): In a given state and input symbol, there may exist multiple possible transition states.
- Deterministic Finite Automaton (DFA): In any state and input symbol, there is only one unique transition state.
Practical Implementation
Let's practice by performing lexical analysis on Java source code using ANTLR. ANTLR is an open-source tool that supports generating lexical analyzers and parsers based on rule files. It is implemented in Java itself. On Mac, you can install it using Homebrew or directly use the ANTLR v4 plugin in IntelliJ IDEA. Additionally, the grammars-v4 repository provides many reference rules. Here, we'll use the lexical analysis rules defined for Java8 for practice.
Lexical Rule Definition: Java8Lexer.g4
# Partial content
- Keyword definitions
ABSTRACT : 'abstract';
ASSERT : 'assert';
BOOLEAN : 'boolean';
BREAK : 'break';
BYTE : 'byte';
CASE : 'case';
CATCH : 'catch';
CHAR : 'char';
...
- String literal definition
StringLiteral: '"' StringCharacters? '"';
fragment StringCharacters: StringCharacter+;
fragment StringCharacter: ~["\\r\n] | EscapeSequence;
...
Java Code to be Analyzed
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello, World");
}
}
Using ANTLR to Generate and Run the Lexical Analyzer
# 1. Compile lexical rules
antlr Java8Lexer.g4
# 2. Compile the generated Java files (Note: The ANTLR JAR file needs to be set in the CLASSPATH environment variable, otherwise an error will occur)
javac JavaLexer.java
# 3. Use the generated lexical analyzer to get analysis results
grun JavaLexer tokens -tokens ./examples/helloworld.java
Syntactic Analysis (Parsing)
Theoretical Concepts
Syntactic analysis, also known as parsing, is performed after lexical analysis. It organizes tokens into a syntactic structure, typically an Abstract Syntax Tree (AST), which represents the syntactic structure of the source code. The parser needs to analyze the token sequence according to a set of predefined syntax rules. These rules are usually defined in the form of Context-Free Grammar (CFG), where each rule defines how a structure in the language is composed of other structures.
Let's briefly explain CFG here. For deeper learning, you can consult additional resources. A context-free grammar consists of the following four components:
- Non-terminals: These are variables of the grammar that represent sets of strings. They are usually denoted by uppercase letters such as A, B, Expr, etc.
- Terminals: These are the basic symbols of the grammar that form the strings of the language. In programming languages, terminals can be keywords, operators, identifiers, etc. They are usually denoted by loewrcase letters, numbers, or other symbols.
- Production rules: These rules define how non-terminals are replaced by sequences of terminals or other non-terminals. The form of the rules is usually A → B C, meaning that non-terminal A can be replaced by B C.
- Start symbol: This is a special non-terminal used to represent the starting point of the entire language or grammar.
-----Example-------
S → a S b
S → ε
-------Explanation--------
· Non-terminal is S.
· Terminals are a and b.
· There are two production rules: S → a S b means S can be replaced by a S b, and S → ε means S can be replaced by the empty string (ε represents the empty string).
· Start symbol is S.
This grammar generates the language of all balanced strings of a and b, such as: ab, aabb, aaabbb, etc.
The core capability of syntactic analysis is given a grammar G and a sentence s, determine whether s can be derived from G. Implementation methods can be roughly divided into bottom-up and top-down parsing:
- Top-down parsing: The process of building the parse tree starting from the top (i.e., the start symbol). In this method, the parser attempts to find which production the input string can start with and gradually expands these productions until the complete input string is obtained. Top-down parsing is intuitive and easy to implement, especially for simple grammars. However, they may not handle left-recursive grammars and may be less efficient for complex grammars. Common algorithms include "Recursive Descent Parsing" and "LL Parsing".
- Bottom-up parsing: The process of building the parse tree starting from the bottom (i.e., the input string). In this method, the parser attempts to find which parts of the input string correspond to the right side of some production in the grammar and reduces it to the left side of the production, gradually reducing the overall length until it is finally reduced to the start symbol. Bottom-up parsing is usually more powerful than top-down parsing because it can handle more complex grammars, including those that top-down parsers cannot handle. However, their parse tables are often more complex and may be more difficult to implement. Common algorithms include "LR Parsing".
Practical Implementation
After understanding the basic concepts, let's practice by performing syntactic analysis on Java source code using ANTLR. This time, we won't use the syntax rules defined in grammars-v4 because the syntax rules of programming languages are relatively complex, and the resulting AST may have poor readability.
Syntax Rule Definition (Lexical rule definition: CommonLexer.g4; Syntax rule definition: PlayScript.g4)
grammar PlayScript;
import CommonLexer; // Import lexical definitions
/* The following content is added to the head of the generated Java source file, such as package name, import statements, etc. */
@header {
package antlrtest;
}
expression
: assignmentExpression
| expression ',' assignmentExpression
;
assignmentExpression
: additiveExpression
| Identifier assignmentOperator additiveExpression
;
assignmentOperator
: '='
| '*='
| '/='
| '%='
| '+='
| '-='
;
additiveExpression
: multiplicativeExpression
| additiveExpression '+' multiplicativeExpression
| additiveExpression '-' multiplicativeExpression
;
multiplicativeExpression
: primaryExpression
| multiplicativeExpression '*' primaryExpression
| multiplicativeExpression '/' primaryExpression
| multiplicativeExpression '%' primaryExpression
;
Using ANTLR to Generate and Run the Syntax Analyzer
# 1. Compile syntax rules
antlr PlayScript.g4
# 2. Compile the generated Java files (Note: The ANTLR JAR file needs to be set in the CLASSPATH environment variable, otherwise an error will occur)
javac *.java
# 3. Use the generated syntax analyzer to get analysis results (input the expression and use ^D to trigger AST generation)
grun antlrtest.PlayScript expression -gui
age + 10 * 2 + 10
^D
Semantci Analysis
Theoretical Concepts
The semantic analyzer uses information from the syntax tree and symbol table to check if the source program is consistent with the language's defined semantics. It also collects type information and stores it in the syntax tree or symbol table for use in subsequent intermediate code generation. Semantic rules include but are not limited to:
- Type checking: Ensuring that operand types are compatible with operators, for example, not allowing an integer to be assigned to a string variable.
- Variable binding: Ensuring that variable and function declarations match their usage, variables are declared before use, and function calls have the correct number and type of parameters as declared.
- Control flow checking: Ensuring that control flow statements (such as loops and conditional statements) in the program are used legally.
- Uniqueness checking: Ensuring that identifier declarations are unique within the same scope, for example, not allowing two variables with the same name to be declared in the same scope.
- Permission and accessibility checking: Ensuring that access to variables and functions conforms to their declared permissions, for example, private members can only be accessed within their class.
Since each language has its own unique semantic rules and characteristics, such as type systems, scope rules, legal operator overloading, etc., which are language-specific. Therefore, semantic analysis must be designed according to the specifications of each language.
Practical Implementation
Since the implementation of semantic analysis varies greatly between different languages, there is no universal semantic analyzer generation tool. Therefore, let's directly examine the relevant source code in the Java compiler to understand the implementation logic. The semantic analysis source code in javac is located in the com.sun.tools.javac.code and com.sun.tools.javac.comp packages.
The following lists some semantic analysis classes. The source code is not posted here, but can be downloaded and read from langtools:
- Symbol: Represents all language symbols, including variables, methods, classes, etc. These symbols are created during compilation and populated into the symbol table.
- Scope: Manages scopes, storing all symbols within a scope. This is crucial for resolving variable and method names to ensure they are visible and valid in the current context.
- Type: Represents all types in the Java language, including primitive types, class and interface types, array types, etc. javac uses this class during type checking to determine type compatibility and perform type conversions.
- Attr: The core class for attribute analysis, responsible for associating type information and other attributes with syntax tree nodes. It performs tasks such as type checking, method resolution, and variable capture.
- Check: Performs various semantic checks, such as checking for circular inheritance in types, or whether a class implements all methods of its interfaces.
- Resolve: Resolves method calls, type names, and variable names. It looks up the correct symbols in the symbol table and handles complex situations like method overload resolution.
- Annotate: Handles annotation-related semantic analysis, including annotation resolution and application.
- Types: Provides a set of utility methods for type operations, such as determining if one type can be assigned to another, or finding the nearest common ancestor type.
- Flow: Performs control flow analysis, such as checking if variables are assigned before use, or if methods always return values.
- LambdaToMethod: Java 8 introduced Lambda expressions, this class is responsible for converting Lambda expressions to anonymous classes or static methods.
- TransTypes and Lower: These classes handle certain type conversions, including generic bridge methods and auto-boxing/unboxing.
Extended Reading
- Classic books: "Dragon Book", "Tiger Book", "Whale Book"
- CS143 (Stanford University course)
- Principles of Compilers - NetEase Cloud Classroom
- Principles of Compilers - Chinese University MOOC