Handling Delimited Content in Lexical Analysis

When working with lexical analysis, clean separations between tokens are not always straightforward. Consider a language with string literals where variables are single characters:

x = "a"
y = "bc"

An initial grammar attempt might look like this:

use super::{Var, Lit, Eql};

grammar;

pub Var: Var = <r"[x-z]"> => <>.chars().next().unwrap().into();

pub Lit: Lit = "\"" <r"[a-z]*"> "\"" => <>.into();

pub Eql: Eql = <Var> "=" <Lit> => (<>).into();

This approach fails due to ambiguity between the regular expressions for variables and string content. Adding precedence rules with match declarations provides a temporary fix but doeesn't solve the fundamental issue:

match {
   r"[x-z]"
} else {
   r"[a-z]*",
   _
}

The proper solution involves treating the entire string (delimiters and content) as a single token:

pub Var: Var = <r"[a-z]"> => <>.chars().next().unwrap().into();

pub Lit: Lit = <l:r#""[a-z ]*""#> => l[1..l.len()-1].into();

pub Eql: Eql = <Var> "=" <Lit> => (<>).into();

This approach eliminates ambiguity and allows for more flexible string content. For handling escape sequences in strings:

pub Lit: Lit = <l:r#""(\\\\|\\"|[^"\\])*""#> => Lit(apply_string_escapes(&l[1..l.len()-1]).into());

Where apply_string_escapes processes the escape sequences in the string content.

Tags: lexical-analysis parsing Grammar string-literals regular-expressions

Posted on Sun, 24 May 2026 17:15:18 +0000 by mhoard8110