Skip to main content

Syntax Definition and Parsing

rascal-0.40.17

Synopsis

Syntax definition and parser generation for new languages.

Description

All source code analysis projects need to extract information directly from the source code. There are two main approaches to this:

  • Lexical information: Use regular expressions to extract useful, but somewhat superficial, flat, information. This can be achieved using regular expression patterns, see Regular Expression Patterns.
  • Structured information: Use syntax analysis to extract the complete, nested, structure of the source code in the form of a syntax tree. Rascal can directly manipulate the parse trees, but it also enables user-defined mappings from parse tree to abstract syntax tree.

Using Syntax Definitions you can define the syntax of any (programming) language. Then Rascal will:

Examples

Let's use the Exp language as example. It contains the following elements:

  • Integer constants, e.g., 123.
  • A multiplication operator, e.g., 3*4.
  • An addition operator, e.g., 3+4.
  • Multiplication is left-associative and has precedence over addition.
  • Addition is left-associative.
  • Parentheses can be used to override the precedence of the operators.

Here are some examples:

  • 123
  • 2+3+4
  • 2+3*4
  • (2+3)*4

The EXP language can be defined as follows:

lexical IntegerLiteral = [0-9]+; 

start syntax Exp
= IntegerLiteral
| bracket "(" Exp ")"
> left Exp "*" Exp
> left Exp "+" Exp
;

layout Whitespace
= [\ \t\n]*;

Now you may parse and manipulate programs in the EXP language. Let's demonstrate parsing an expression:

rascal>import ParseTree;
ok
rascal>parse(#start[Exp], "2+3*4");
start[Exp]: (start[Exp]) `2+3*4`
rascal>t = [Exp] "1+2*3";
Exp: (Exp) `1+2*3`
rascal>import vis::Text;
ok
rascal>import IO;
ok
rascal>println(prettyTree(t))
Exp = Exp "+" Exp
├─ Exp = IntegerLiteral
│ └─ IntegerLiteral = [0-9]+
│ └─ [0-9]+
│ └─ 1
└─ Exp = Exp "*" Exp
├─ Exp = IntegerLiteral
│ └─ IntegerLiteral = [0-9]+
│ └─ [0-9]+
│ └─ 2
└─ Exp = IntegerLiteral
└─ IntegerLiteral = [0-9]+
└─ [0-9]+
└─ 3
ok

First we import the syntax definition and the link:/Libraries/Prelude-ParseTree[ParseTree] module that provides the parsing functionality. Finally, we parse 2+3*4 using the start symbol Exp.

See the EXP demo for a more extensive presentation of the EXP language and Languages for other language examples.

Benefits

  • Rascal grammars are relatively easy to read and write (unfortunately, writing grammars will never become simple).
  • Parser generation is completely implicit.
  • Given a syntax definition, it can be used immediately for parsing.

Pitfalls

  • Grammars can be ambiguous and then Rascal produces multiple parse trees instead of one.