Understanding PostgreSQL's Parser Architecture
When examining PostgreSQL's source code, one might notice that the conventional Bison parser function yyparse is replaced by base\_yyparse(). This modification is part of PostgreSQL's customized implementation of Flex and Bison for its SQL parsing needs.
The PostgreSQL parser configuration in gram.y includes several special directives:
%name-prefix="base_yy"
%parse-param {core_yyscan_t yyscanner}
%lex-param {core_yyscan_t yyscanner}
These directives accomplish three key tasks: 1. Renaming all generated functions and variables with the "base_yy" prefix 2. Adding a scanner parameter to the parse function 3. Passing the scanner object to the lexer
The Scanner Object
In standard Flex implementations, the scanner object is of type yyscan\_t. PostgreSQL, however, customizes this by renaming it to core\_yyscan\_t in its scan.l file using the directive:
%option prefix="core_yy"
This renaming maps the scanner type to an opaque void \* pointer, providing an additional layer of abstraction in PostgreSQL's parser implementation.
Parser Initialization Flow
The parsing process in PostgreSQL begins with the raw_parser() function in parser.c, which orchestrates the following sequence:
- Initialize the scanner via
scanner_init() - Set up the parser with
parser_init() - Execute the main parsing function
base_yyparse() - Clean up with
scanner_finish()
Customized Lexer Implementation
Unlike standard Bison which calls yylex(), PostgreSQL uses base\_yylex() with a modified signature:
extern int base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner);
The three parameters in this function are generated by specific Bison directives:
core_yyscan_t yyscannercomes from%lex-paramYYSTYPE* lvalpis generated by%pure-parserYYLTYPE* llocpcomes from%locations
Inside base_yylex(), the function sets semantic values and token locations for the parser. This implementation acts as a filter between the core lexer and the parser, with most cases requiring only a single call to the underlying core_yylex() function.
Lookahead Tokens in PostgreSQL
In certain parsing scenarios, a single lookahead token is insufficient for shift/reduce decisions. PostgreSQL addresses this by implementing special cases where core_yylex() may be called multiple times consecutively.
A notable example is the WITH_LA token, introduced to handle multiword tokens. This token specifically helps the parser distinguish between different uses of the WITH keyword, particularly when combined with ORDINALITY in Common Table Expressions (CTEs).
Keyword Management
PostgreSQL maintains all SQL keywords in src/include/parser/kwlist.h. This header file is included in scan.l and contains the complete list of recognized keywords. When the lexical analyzer encounters an identifier, it utilizes utility functions to determine if the identifier matches a keyword and returns the appropriate token.
The keyword list in kwlist.h can be efficiently managed using sorting utilities, ensuring optimal performance during the lexical analysis phase.