Indentation
Nim is an indentation-based language, like Python. This means that the structure of the code is determined by the indentation level of the lines, and that whitespace is significant. Consider the following example:
proc foo() = # ──────────────────┐0
for i in 0 ..< 10: # ─────────────┐1 │
echo i # ────────┐2 │ │
if i mod 2 == 0: # │ │ │
echo "even" # ── 3 │ │ │
else: # │ │ │
echo "odd" # ── 3 ◄──┘ ◄──┘ ◄──┘
There are four levels of indentation here. The first level is the global scope, which is at indentation 0. Block-defining elements like proc
, for
, and if
statements increase the indentation level. Typically, Nim code uses 2 spaces for indentation, but this is not a strict requirement. And although not recommended, we can use different number of spaces for different levels of indentation, as long as the indentation level within the same block is consistent.
As one might expect, parsing indentation-based languages is more difficult than parsing languages with explicit block delimiters like curly braces. We need to keep track of the indentation level of each line, and we often need to compare the indentation level of a line with the previous lines to determine the structure of the code.
There are two places where we can track indentation: in the lexer or in the parser.
- If we track indentation in the lexer, we need to recognize the indentation level of each line and emit tokens that represent indentation changes, as well as for staying in the same indentation level. These tokens are then consumed by the parser grammar to delimit blocks.
- If we track indentation in the parser, we to be able to access the length of the leading whitespace of the first token of each line. This is not always possible with lexer generators, as they typically don't provide metadata about the tokens they emit. Also, using a parser generator like Grammar-Kit makes it even more challenging to define rules based on the leading whitespace of a token.
While there is a way in Grammar-Kit to write custom utility functions that can be used as grammar rules (which we can use to access the underlying text and calculate the indentation levels), it is not a straightforward process. So, I'm going to take the route of tracking indentation in the lexer. This comes with a different problem though: leading whitespace becomes part of the token stream, and is not ignored by the parser anymore. The grammar rules have to account for cases where indentation tokens are present, when we wish to treat them as whitespace (e.g. at the end of the file).
Indentation Tokens
We will need three kinds of indentation tokens: IND
(for increasing indentation), DED
(for decreasing indentation), and EQD
(for staying at the same indentation). The reason for the EQD
token is that we will use it as a separator between successive statements, much the same way as a semicolon in C-like languages.
Let's use these tokens with the example above and see what we need to emit at indentation token(s) we need to emit at each line:
proc foo() =
for i in 0 ..< 10: # IND
echo i # IND
if i mod 2 == 0: # EQD
echo "even" # IND
else: # DED, EQD
echo "odd" # IND
# back to top level # DED, DED, DED
Most of the lines should be easy to understand. But let's look closer at lines 6 and 8 in particular, where we emit multiple tokens. On line 6, the first part of the if
/else
block ends, and we go back to the same level of the if
statement. So we emit both a DED
token (to close the if
block) and a EQD
token (to separate the if
block from the else
block). On line 8, we close the else
block and go back to the top level, which requires emitting three DED
tokens to close the else
block, the for
block, and the proc
block.
Lexer State Machine
We'll need to introduce a new BEGIN_LINE
state in our lexer, and switch to it once we encounter a new line. We'll also need a stack to keep track of the indentation levels we are in. The stack is initialized with indentation level 0. The following diagram shows the state machine we need to implement.
Here's how it works:
- We start in the
YYINITIAL
state (the default lexer state). - Since we always start at the beginning of a line, we switch to the
BEGIN_LINE
state. - In the
BEGIN_LINE
state, we treat any empty lines as whitespace. - In the
BEGIN_LINE
state, we use a regex to match the leading whitespace of the line and get its length. This is the indentation level of the line (which could be zero). - We have three cases:
- If the indentation level is greater than the top of the stack, we emit an
IND
token, push the new indentation level to the stack, and switch to theDEFAULT
state. - If the indentation level is equal to the top of the stack, we emit an
EQD
token, and switch back to theDEFAULT
state (we don't modify the stack). - The third case handles decrease in indentation, and is more involved than the other two. That's because the decrease in indentation could be insufficient relative to the parent block. So, the rule is:
- If there's more than one level of indentation on the stack, and the current indentation level is less than or equal to the second-to-top level of the stack, we emit a
DED
token, pop the stack, but we do not switch back to theDEFAULT
state just yet. Instead, we stay in theBEGIN_LINE
state, push back the whitespace text to the lexer, and let the lexer reprocess the line. This allows us to emit the correct number ofDED
tokens until we reach the correct indentation level.
- If there's more than one level of indentation on the stack, and the current indentation level is less than or equal to the second-to-top level of the stack, we emit a
- If none of the above cases are met, then we have a case where the decrease in indentation is insufficient, and we emit an
INVALID_IND
token.
- If the indentation level is greater than the top of the stack, we emit an
The logic is a bit intricate, especially for the decrease in indentation case, but it's the price we're willing to pay to keep the parser simple. Let's go ahead and add a processIndentation
method to our lexer so that we can call it while in the BEGIN_LINE
state.
// src/main/kotlin/khaledh/nimjet/lexer/Nim.flex
...
// lexer class code
%{
private Stack<Integer> indentStack = new Stack<>();
%}
%init{
// initial indentation level is 0
indentStack.push(0);
%init}
%{
private IElementType processIndent() {
int currIndent = yylength();
if (currIndent > indentStack.peek()) {
// new indent level
indentStack.push(currIndent);
yybegin(DEFAULT);
return NimToken.IND;
}
if (currIndent == indentStack.peek()) {
// same indent level
yybegin(DEFAULT);
return NimToken.EQD;
}
if (indentStack.size() > 1 && currIndent <= indentStack.get(indentStack.size() - 2)) {
// We can only dedent one level at a time, so don't switch back to DEFAULT just yet,
// and keep returning DED tokens as long as there's more dedent levels.
indentStack.pop();
yypushback(yylength());
return NimToken.DED;
}
// invalid indentation
return NimToken.INVALID_IND;
}
%}
// lexer states
%state DEFAULT
%state BEGIN_LINE
// macros
EOL = \r\n|\r|\n
%%
<YYINITIAL> [^] { yypushback(1); yybegin(BEGIN_LINE); }
<DEFAULT> {
...
{EOL} { yybegin(BEGIN_LINE); return TokenType.WHITE_SPACE; }
...
}
<BEGIN_LINE> {
[ \t]*{EOL} { return TokenType.WHITE_SPACE; /* skip empty lines */ }
[ \t]* { return processIndent(); }
}
When I tested this on a code sample, the IDE threw an error saying "Lexer is not progressing after calling advance()"
. At first, I was puzzled by this, so I fired up the debugger and traced the code. It turns out that the IDE tries to validate that the lexer is making progress by ensuring that it doesn't produce the same token multiple times in a row at the same location while in the same state. Unfortunately, that's exactly what we're trying to do when we need to emit multiple DED
tokens in a row when the indentation decreases more than one level.
To work around this issue, I created two identical copies of the BEGIN_LINE
state: BEGIN_LINE
and BEGIN_LINE_2
, and updated the DED
case to toggle between the two states. It's a hack, but it works.
// src/main/kotlin/khaledh/nimjet/lexer/Nim.flex
...
// lexer class code
%{
...
private IElementType processIndent() {
...
if (indentStack.size() > 1 && currIndent <= indentStack.get(indentStack.size() - 2)) {
// We can only dedent one level at a time, so don't switch back to DEFAULT just yet,
// and keep returning DED tokens as long as there's more dedent levels.
//
// Also, IntelliJ's lexer validation doesn't like returning the same token multiple
// times in a row at the same location while in the same state (throws an exception
// about "Lexer is not progressing"), so as a workaround we toggle between two
// identical states to avoid this issue.
int nextState = yystate() == BEGIN_LINE ? BEGIN_LINE_2 : BEGIN_LINE;
yybegin(nextState);
indentStack.pop();
yypushback(yylength());
return NimToken.DED;
}
}
%}
// lexer states
%state BEGIN_LINE BEGIN_LINE_2
...
%%
// We use two identical states to avoid a lexer validation issue; see processIndent.
<BEGIN_LINE, BEGIN_LINE_2>
[ \t]*{EOL} { return TokenType.WHITE_SPACE; }
[ \t]* { return processIndent(); }
}
This keeps the lexer validation happy, and we can now emit multiple DED
tokens in a row from the same location.
Parsing Top-Level Code
Now that we have the indentation tokens in the token stream, we need to modify our parser to use them to delimit blocks. The first obvious change is to delimit the top-level statement list with EQD
tokens, since all top-level statements should be at indentation level 0. Let's modify the StmtList
rule to account for this.
// src/main/kotlin/khaledh/nimjet/parser/Nim.bnf
{
...
}
Module ::= !<<eof>> StmtList
StmtList ::= Stmt (EQD Stmt)*
Stmt ::= LetSection
| Command
LetSection ::= LET IdentDecl EQ STRING_LIT
Command ::= IdentRef IdentRef
IdentDecl ::= IDENT
IdentRef ::= IDENT
Let's generate the parser (Cmd+Shit+G
) and try it out.
Although there are no visible changes in the PSI tree, the code is parsed correctly as expected. Unfortunately we don't see the EQD
token element in the tree as I was expecting. But when I debug the lexer I see it emitting the token. My assumption is that the PSI tree builder doesn't allow multiple elements at the same location, as in this case where the EQD
token occupies the same start location as the second Stmt
element, and uses the last element (in sibling order) as the element for that location. It's a bit annoying not to be able to verify the existence of token, but it doesn't impact the correctness of the parsed tree. In fact, it could be considered a good thing, since, conceptually, indentation elements should be considered whitespace, which normally doesn't show up in the tree.
This works, but there's a few issues we need to address. Let's introduce an empty line at the beginning of the file.
We get an error saying that <stmt list>
was expected at the beginning of the second line. The reason is that the lexer emitted an EQD
token for the empty line, which is not expected by the parser at that location. We can easily fix this by skipping whitespace at the beginning of the file. Instead of introducing more rules to handle this, we can just use a flag firstNonEmptyLine
(defaults to true
), to decide whether we're processing the first line or not, and act accordingly.
// src/main/kotlin/khaledh/nimjet/lexer/Nim.flex
...
%{
private Stack<Integer> indentStack = new Stack<>();
private Boolean firstNonEmptyLine = true;
%}
...
%{
private IElementType processIndent() {
int currIndent = yylength();
// we don't want to emit EQD for the first non-empty line
if (firstNonEmptyLine && currIndent == 0) {
firstNonEmptyLine = false;
yybegin(DEFAULT);
return TokenType.WHITE_SPACE;
}
...
%}
This should fix the issue with the first non-empty line. Let's test it out.
Great! Let's also test the case where the first non-empty line has leading whitespace.
We get an error as expected! Our lexer and parser can now recognize and handle top-level indentation correctly.
Parsing Blocks
Now that we have the top-level statement list working, let's move on to parsing blocks with actual indentation. The simplest construct in Nim that uses indentation is the block
statement, which contains an indented StmtList
inside it, including nested block
statements.
Let's start by adding a BLOCK
token to our lexer (and define the corresponding token in NimToken
).
// src/main/kotlin/khaledh/nimjet/lexer/Nim.flex
...
<DEFAULT> {
...
"let" { return NimToken.LET; }
"block" { return NimToken.BLOCK; }
...
}
Next, let's add a new rule for BlockStmt
in our parser.
// src/main/kotlin/khaledh/nimjet/parser/Nim.bnf
...
BlockStmt ::= BLOCK COLON IND StmtList DED
The rule is simple: a block
statement starts with the block
keyword, followed by a colon, and a StmtList
enclosed between IND
and DED
tokens. Let's generate the parser and test it out.
We get an error saying that the parser was expecting either a DED
or EQD
at the end. That's because the file ends right after the last statement, and so there are no indentation tokens to close the block. This is a problem we cannot solve at the lexer level, unfortunately. What we can do, is to allow blocks to end with either a DED
token or the <<eof>>
special marker. Since we're going to need this for other rules, let's introduce a private rule DED_OR_EOF
that matches either a DED
token or the end of file, and use it in places where we expect a DED
token. A private rule means it doesn't get a dedicated node in the PSI tree.
// src/main/kotlin/khaledh/nimjet/parser/Nim.bnf
...
BlockStmt ::= BLOCK COLON IND StmtList ded_or_eof
private ded_or_eof ::= DED | <<eof>>
This solves the problem of blocks ending at the end of the file. Let's test another scenario where we create a new line that has the same indentation level inside the block, but that doesn't contain a statement, i.e. the end of file comes right after the leading whitespace.
We get an error that says <stmt>
expected. This is an issue similar to the one above. The lexer had emitted an EQD
token at that location, and so the parser expects a statement to come after it. While the behaviour here is correct, and can be fixed by expecting the user to remove the leading whitespace, it's not a good experience. Users expect empty lines anywhere in the file to be ignored.
We can do something similar to allowing blocks to end with <<eof>>
by modifying the StmtList
rule to make the Stmt
instance that comes after EQD
optional, but this situation might come up in other places in the future as well. So it's better to solve it at the lexer level by treating a line with only whitespace at the end of the file as whitespace.
Unfortunately, there's no official way to match the end of file as part of a regex in JFlex. There's an <<EOF>>
rule that matches the end of file, but it can only be used alone, and not as part of a regex. If we try to use it alone, it would be too late, since the leading whitespace according to our indentation rules, and we may have already emitted an indentation token. So, I'm going to rely on a private variable in the generated lexer class called zzEndRead
to get the end of file position (I know, I shouldn't use unofficial features, but I have no other way), and use it to determine if the end of the current line is at the end of the file. If it is, we return a whitespace token.
// src/main/kotlin/khaledh/nimjet/lexer/Nim.flex
%{
...
private IElementType processIndent() {
int currIndent = yylength();
// handle a line with only whitespace at the end of the file
if (getTokenEnd() == zzEndRead) {
return TokenType.WHITE_SPACE;
}
...
}
...
This should solve the issue, treating the last line as whitespace if it's empty. The parser should be happy, since there's no extra indentation tokens at the end of the file.
With this modification in place, we can actually revisit the situation that required us to add the ded_or_eof
rule to handle the end of file in blocks. We can now add a lexer rule to match the EOF in any state, and pop all the remaining indentation levels from the stack, emitting an DED
for each of them. The reason we couldn't do this before is that, in the case where the last line contains only whitespace, we would have emitted a EQD
first, followed by the DED
tokens from the stack, which would have been incorrect. But now that we ignore the last line if it's empty, we can safely emit those DED
tokens once we encounter the end of file.
Let's add a new method processEof()
to our lexer to handle this, and call it when we encounter the end of file from any state.
%{
...
private IElementType processEof() {
// return DED tokens (one at a time) for all remaining indent levels
if (indentStack.size() > 1) {
indentStack.pop();
return NimToken.DED;
}
return null;
}
%}
...
<<EOF>> { return processEof(); }
processEof()
will be called multiple times (lexer stays at EOF) until the stack is empty, at which point we return null
to indicate that there are no more tokens.
If we test this we run into the same issue as before, where the lexer doesn't progress because it's returning the same token (DED
) multiple times in a row at the same location. So, we need to use the same trick as before, and toggle between two identical states to avoid this issue.
%{
...
private IElementType processEof() {
// return DED tokens (one at a time) for all remaining indent levels
if (indentStack.size() > 1) {
indentStack.pop();
int nextState = yystate() == AT_EOF ? AT_EOF_2 : AT_EOF;
yybegin(nextState);
return NimToken.DED;
}
return null;
}
%}
...
%states AT_EOF AT_EOF_2
%%
...
<<EOF>> { yybegin(AT_EOF); }
<AT_EOF, AT_EOF_2> <<EOF>> { return processEof(); }
This should handle the end of file issue correctly, emitting the correct number of DED
tokens to close any open blocks.
Now, let's remove the ded_or_eof
rule from the BlockStmt
rule, and revert it back to its simpler form.
BlockStmt ::= BLOCK COLON IND StmtList ded_or_eof
BlockStmt ::= BLOCK COLON IND StmtList DED
private ded_or_eof ::= DED | <<eof>>
Much better!
I believe this takes care of all the indentation issues. Not bad; for a few dozen lines of lexer code we have a working indentation-based parser. We can now focus on adding more grammar rules that build on top of this.