Symbol
Synopsis
The symbols that can occur in a syntax definition.
Syntax
Nonterminal symbols are identifier names that start with an uppercase letter.
Symbol | Description |
---|---|
Symbol fieldName | Any symbol can be labeled with a field name that starts with a lowercase letter |
The following literal symbols and character classes are defined:
Symbol | Description |
---|---|
"stringliteral" | Literal string |
'stringliteral' | Case-insensitive literal string |
[range₁ range₂ ... ] | Character class |
The following operations on character classes can be composed arbitrarily:
Class | Description |
---|---|
!Class | Complement of Class with respect to the UTF8 universe of characters |
Class₁ - Class₂ | Difference of character classes Class₁ and Class₂ |
Class₁ \|\| Class₂ | Union of character classes Class₁ and Class₂ |
Class₁ && Class₂ | Intersection of character classes Class₁ and Class₂ |
(Class) | Brackets for defining application order of class operators |
The following regular expressions can be constructed over Symbols:
Symbol | Description |
---|---|
Symbol? | Optional Symbol |
Symbol+ | Non-empty list of _Symbol_s |
Symbol* | Possibly empty list of _Symbol_s. |
{Symbol₁ Symbol₂}+ | Non-empty list of Symbol₁ separated by Symbol₂ |
{Symbol₁ Symbol₂}* | Possibly empty list of Symbol₁ separated by Symbol₂. |
(Symbol₁ Symbol₂ ... ) | Embedded sequence of symbols |
(Symbol₁ \| Symbol₂ \| ... ) | Embedded choice of alternative symbols |
() | The anonymous non-terminal for the language with the empty string |
Inline conditions (Disambiguations) can be added to symbols to constrain their acceptability:
Disambiguation | Description |
---|---|
Symbol $ | Symbol ends at end of line or end of file |
^Symbol | Symbol starts at begin of line |
Symbol @ ColumnIndex | Symbol starts at certain column index. |
Symbol₁ >> Symbol₂ | Symbol₁ must be (directly) followed by Symbol₂ |
Symbol₁ !>> Symbol₂ | Symbol₁ must not be (directly) followed by Symbol₂ |
Symbol₁ << Symbol₂ | Symbol₂ must be (directly) preceded by Symbol₁ |
Symbol₁ !<< Symbol₂ | Symbol₂ must not be (directly) preceded by Symbol₁ |
Symbol₁ \ Symbol₂ | Symbol₁ must not be in the language defined by Symbol₂ |
Symbols can be composed arbitrarily.
Types
Every non-terminal symbol is a type.
Description
The basic symbols are the non-terminal name and the labeled non-terminal name. These refer to the names defined by Syntax Definition. You can use any defined non-terminal name in any other definition (lexical in syntax, syntax in lexical, etc).
Then we have literals and character classes to define the terminals of a grammar.
When you use a literal such as "begin"
, Rascal will produce a definition for it down to the character level before generating a parser: syntax "begin" = [b][e][g][i][n];
. This effect will be visible in the Parse Trees produced by the parser. For case insensitive literals you will see a similar effect; the use of 'begin'
produces syntax 'begin' = [bB][eE][gG][iI][nN]
.
Character classes have the same escaping conventions as characters in a String literal, but spaces and newlines are meaningless and have to be escaped and the [
and ]
brackets as well as the dash -
need escaping. For example, one writes [\[ \] \ \n\-]
for a class that includes the open and close square brackets and a space, a newline and a dash. Character classes support ranges as in [a-zA-Z0-9]
. Please note about character classes that:
- the operations on character classes are executed before parser generation time. You will not find explicit representation of these operations in Parse Tree, but rather their net effect as resulting character classes.
- Character classes are also ordered by Rascal and overlapping ranges are merged before parsers are generated. Equality between character classes is checked after this canonicalization.
- Although all Symbols are type constructors, the character class operators are not allowed in types.
The other symbols either generate for you parts of the construction of a grammar, or they constrain the rules of the grammar to generate a smaller set of trees as Disambiguations.
The generative symbols are referred to as the regular symbols. These are like named non-terminals, except that they are defined implicitly and interpreted by the parser generator to produce a parser that can recognize a symbol optionally, iteratively, alternatively, sequentially, etc. You also need to know this about the regular symbols:
- In Parse Trees you will find special nodes for the regular expression symbols that hide how these were recognized.
- Patterns using Concrete Syntax have special semantics for the regular symbols (list matching, separator handling, ignoring layout, etc.).
- Regular symbols are not allowed in keyword Syntax Definitions
- Depending on their occurrence in a lexical, syntax or layout Syntax Definition
the semantics of regular symbols changes. In the syntax context, layout non-terminals will be woven
into the regular symbol, but not in the lexical and layout contexts.
For example, a
Symbol\*
in a syntax definition such assyntax X = A*;
will be processed tosyntax X =
{A Layout}*. Similarly,
syntax X = {A B}+;will be processed to
syntax X = {A (Layout B Layout)}+;`.
The constraint symbols are specially there to deal with the fact that Rascal does not generate a scanner. There are no a priori disambiguation rules such as prefer keywords or longest match. Instead, you should use the constraint symbols to define the effect of keyword reservation and longest match.
- It is important to note that these constraints work on a character-by-character level in the input stream. So, a follow constraint such as
A >> [a-z]
means that the character immediately following a recognized A must be in the range[a-z]
. - Read more on the constraint symbols via Disambiguations.
Examples
A character class that defines all alphanumeric characters:
lexical AlphaNumeric = [a-zA-Z0-9];
A character class that defines anything except quotes:
lexical AnythingExceptQuote = ![\"];
An identifier class with longest match (can not be followed immediately by [a-z]):
lexical Id = [a-z]+ !>> [a-z];
An identifier class with longest match and first match (can not be preceded or followed by [a-z]):
rascal>lexical Id = [a-z] !<< [a-z]+ !>> [a-z];
ok
An identifier class with some reserved keywords and longest match:
lexical Id = [a-z]+ !>> [a-z] \ "if" \ "else" \ "fi";
An optional else branch coded using sequence and optional symbols:
syntax Statement = "if" Expression "then" Statement ("else" Statement)? "fi";
A block of statements separated by semicolons:
syntax Statement = "{" {Statement ";"}* "}";
A declaration with an embedded list of alternative modifiers and a list of typed parameters:
syntax Declaration = ("public" | "private" | "static" | "final")* Type Id "(" {(Type Id) ","}* ")" Statement;
Benefits
- The symbol language is very expressive and can lead to short definitions of complex syntactic constructs.
- There is no built-in longest match for iterators, which makes syntax definitions open to languages that do not have longest match.
- There is no built-in keyword preference or reservation, which makes syntax definitions open to language composition and legacy languages.
Pitfalls
- By nesting too many symbols definitions can be become hard to understand.
- By nesting too many symbols pattern matching and term construction becomes more complex. Extra non-terminals and rules with meaningful names can make a language specification more manageable.
- The lack of automatic longest match and prefer keyword heuristics (you have to define it yourself), sometimes leads to unexpected ambiguity. See Disambiguation.