Syntax Definition
Synopsis
Syntax Definitions allow the definition of parsers for programming languages, domain-specific languages and data formats.
Syntax
start syntax Nonterminal = Alternatives;
lexical Nonterminal = Alternatives;
layout Nonterminal = Alternatives;
keyword Nonterminal = Alternatives;
where Start is either start
or nothing, and Alternatives are one of:
Tags Symbols
Tags left Symbols ❶
Tags right Symbols ❶
Tags non-assoc Symbols ❶
Tags left Name : Symbols ❷
Tags right Name : Symbols ❷
Tags non-assoc Name : Symbols ❷
left ( Alternatives ) ❸
right ( Alternatives ) ❸
non-assoc ( Alternatives ) ❸
Alternatives₁ | Alternatives₂ ❹
Alternatives₁ > Alternatives₂ ❺
Here Associativity is nothing, or one of assoc
, left
, right
or non-assoc
, and Tags are a possibly empty list of Tags.
- ❶ for binary recursive expression operators, Associativity disambiguates and chooses binding to the
left
,right
, ornon-assoc
(demanding brackets) - ❷ a
Name
provides the algebraic constructor name for an abstract syntax definition derived from this grammar - ❸ an Associativity group is not only a short-hand, but also derives the cross product associativity rule between all members of the group.
- ❹ this is the normal BNF-like rule alternative combinator
- ❺ this defines a transitively closed Priority relation for all left- or right-recursive binary or unary expression operators (see Priority )
Description
Rascal supports full context-free grammars for syntax definition. It generates scannerless parsers from these definitions. These parsers produce Parse Trees that can be further processed by Rascal using Concrete Syntax fragments in Patterns and Expressions, or they can be imploded to Algebraic Data Types.
There are four kinds of non-terminals that can be defined with slightly different characteristics.
Syntax non-terminals are general context-free non-terminals. This means that left-recursion, right-recursion, any of the regular expression Symbols and all kinds of Disambiguation can be used to define it. It is important to note that in between the Symbols that define a syntax non-terminal the locally defined layout non-terminal will be interleaved. For example, if you define
layout ML = [\ ]*;
andsyntax A = "a" "a"
, Rascal will modify the definition of A tosyntax A = "a" ML "a";
before generating a parser.Lexical non-terminals are very much like syntax non-terminals. However, the definition of a lexical is not modified with interleaved layout non-terminals. And, the structure of lexicals is not traversed by the visit statement and equality is checked between lexicals by checking the characters (not its structure) for equality.
Layout non-terminals are just like syntax non-terminals as well. However, they are used to preprocess all syntax definitions in the same module scope (see above).
Keyword non-terminals are not like syntax non-terminals. These only allow definition of enumeration of literal symbols and single character classes. Keyword non-terminals play an important role in the semantics of Disambiguation where some disambiguation constructs require finite, non-empty enumeration of strings. The prime example is the definition of reserved keywords.
Each alternative of a syntax definition is defined by a list of Symbols. Each of the Symbols can be labeled or not. The alternative of a defined syntax type may be labeled or not as well. With the label, additional operations are activated on the corresponding parse trees:
- The
is
operator is defined for labeled alternatives (see Operators). - The
has
operator is defined for labeled Symbols in the right-hand side (see Operators). - Action functions can be written to override the construction of a parse tree, using the label of an alternative as the function name
- Implode uses labeled alternatives to map to an Algebraic Data Type
Alternatives can be combined in a single Syntax Definition using the |
, >
and associativity combinators.
The latter two represent Disambiguation constructs that you should read more about. The |
is a short-hand for not having to repeat syntax A =
for every alternative of A
.
Alternatives can be named or not. The names are essential only if:
- you need to implode Parse Trees
- you need to use the
is
expression, as inmyStatement is ifThenElse
instead of using concrete pattern matching. - you want to write Actions that triggers on the construction of the alternative.
However, it is generally a good idea to name your rules even if you do not need them. Note that a name may be reused for different alternatives for a single non-terminal, provided that the lists of symbols for these "overloaded" alternatives use different non-terminal symbols. This implies that alternatives for lexicals generally do not use overloaded names because they are often defined only by regular expressions over terminal Symbols (literals and character classes).
The start modifier identifies the start of a grammar. The effect of a start modifier is that Rascal will generate an extra syntax definition before generating a parser that allows layout to before and after the start non-terminal. For example:
layout L = [\ ]*; start Program = Statement*;`
will produce syntax start[Program] = L Program top L;
.
Note that the start[Program]
type is now available in your program, and Parse Trees assigned to variable of that
type will allow access to the top field.
Examples
The following example makes use of practically all of the Syntax Definition features, except parse actions.
// layout is lists of whitespace characters
layout MyLayout = [\t\n\ \r\f]*;
// identifiers are characters of lowercase alphabet letters,
// not immediately preceded or followed by those (longest match)
// and not any of the reserved keywords
lexical Identifier = [a-z] !<< [a-z]+ !>> [a-z] \ MyKeywords;
// this defines the reserved keywords used in the definition of Identifier
keyword MyKeywords = "if" | "then" | "else" | "fi";
// here is a recursive definition of expressions
// using priority and associativity groups.
syntax Expression
= id: Identifier id
| null: "null"
| left multi: Expression l "*" Expression r
> left ( add: Expression l "+" Expression r
| sub: Expression l "-" Expression r
)
| bracket "(" Expression ")"
;
Benefits
- Modular and compositional.
- No grammar normalization or grammar factoring necessary.
- Generate a parser for any context-free grammar.
- Generated parsers are really fast (for general parsers).
- Powerful disambiguation constructs for common programming language disambiguation patterns.
- Data-dependent (context-sensitive) disambiguation via arbitrary functions.
- Embedding of concrete syntax fragments in Rascal programs
- Syntax Definitions follow the syntax and semantics of Algebraic Data Types quite closely.
Pitfalls
- Grammars may be ambiguous, so read about Disambiguation, Ambiguity Detection and Ambiguity Diagnosis.
- Static grammar checker is not implemented yet.