When working with lexical analysis, clean separations between tokens are not always straightforward. Consider a language with string literals where variables are single characters:
x = "a"
y = "bc"
An initial grammar attempt might look like this:
use super::{Var, Lit, Eql};
grammar;
pub Var: Var = <r"[x-z]"> => <>.chars().next().unwrap().into();
pub Lit: Lit = "\"" <r"[a-z]*"> "\"" => <>.into();
pub Eql: Eql = <Var> "=" <Lit> => (<>).into();
This approach fails due to ambiguity between the regular expressions for variables and string content. Adding precedence rules with match declarations provides a temporary fix but doeesn't solve the fundamental issue:
match {
r"[x-z]"
} else {
r"[a-z]*",
_
}
The proper solution involves treating the entire string (delimiters and content) as a single token:
pub Var: Var = <r"[a-z]"> => <>.chars().next().unwrap().into();
pub Lit: Lit = <l:r#""[a-z ]*""#> => l[1..l.len()-1].into();
pub Eql: Eql = <Var> "=" <Lit> => (<>).into();
This approach eliminates ambiguity and allows for more flexible string content. For handling escape sequences in strings:
pub Lit: Lit = <l:r#""(\\\\|\\"|[^"\\])*""#> => Lit(apply_string_escapes(&l[1..l.len()-1]).into());
Where apply_string_escapes processes the escape sequences in the string content.