PostgreSQL's Parser Implementation with Flex and Bison

Understanding PostgreSQL's Parser Architecture

When examining PostgreSQL's source code, one might notice that the conventional Bison parser function yyparse is replaced by base\_yyparse(). This modification is part of PostgreSQL's customized implementation of Flex and Bison for its SQL parsing needs.

The PostgreSQL parser configuration in gram.y includes several special directives:

%name-prefix="base_yy"
%parse-param {core_yyscan_t yyscanner}
%lex-param   {core_yyscan_t yyscanner}

These directives accomplish three key tasks: 1. Renaming all generated functions and variables with the "base_yy" prefix 2. Adding a scanner parameter to the parse function 3. Passing the scanner object to the lexer

The Scanner Object

In standard Flex implementations, the scanner object is of type yyscan\_t. PostgreSQL, however, customizes this by renaming it to core\_yyscan\_t in its scan.l file using the directive:

%option prefix="core_yy"

This renaming maps the scanner type to an opaque void \* pointer, providing an additional layer of abstraction in PostgreSQL's parser implementation.

Parser Initialization Flow

The parsing process in PostgreSQL begins with the raw_parser() function in parser.c, which orchestrates the following sequence:

  1. Initialize the scanner via scanner_init()
  2. Set up the parser with parser_init()
  3. Execute the main parsing function base_yyparse()
  4. Clean up with scanner_finish()

Customized Lexer Implementation

Unlike standard Bison which calls yylex(), PostgreSQL uses base\_yylex() with a modified signature:

extern int base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner);

The three parameters in this function are generated by specific Bison directives:

  • core_yyscan_t yyscanner comes from %lex-param
  • YYSTYPE* lvalp is generated by %pure-parser
  • YYLTYPE* llocp comes from %locations

Inside base_yylex(), the function sets semantic values and token locations for the parser. This implementation acts as a filter between the core lexer and the parser, with most cases requiring only a single call to the underlying core_yylex() function.

Lookahead Tokens in PostgreSQL

In certain parsing scenarios, a single lookahead token is insufficient for shift/reduce decisions. PostgreSQL addresses this by implementing special cases where core_yylex() may be called multiple times consecutively.

A notable example is the WITH_LA token, introduced to handle multiword tokens. This token specifically helps the parser distinguish between different uses of the WITH keyword, particularly when combined with ORDINALITY in Common Table Expressions (CTEs).

Keyword Management

PostgreSQL maintains all SQL keywords in src/include/parser/kwlist.h. This header file is included in scan.l and contains the complete list of recognized keywords. When the lexical analyzer encounters an identifier, it utilizes utility functions to determine if the identifier matches a keyword and returns the appropriate token.

The keyword list in kwlist.h can be efficiently managed using sorting utilities, ensuring optimal performance during the lexical analysis phase.

Tags: PostgreSQL Flex bison parser lexical analysis

Posted on Tue, 23 Jun 2026 16:50:25 +0000 by jstarkweather