The first step in a modern compiler is the so-called Lexical Analysis. This is a piece of software that takes the source code in input (i.e., the code that must be compiled) and generates a list of tokens.
The Lexical Analyzer in a Compiler performs two tasks:
- Divide the input string (that’s the source code) into its units, such as variables, operators, keywords, etc. These are called lexemes.
- Assign each lexeme to its actual class. Classes are, for instance,
identifier
,keyword
,operator
, etc.
This means that each Token in the final list (generated by the Lexical Analyzer) is a pair <lemexe, class>
.
In practical terms, and simplifying just a little bit, the first step corresponds to splitting the source code into pieces according to blank spaces, line feeds, tabs and whitespaces in general. When that’s done, each of the extracted pieces must be assigned to a class, and this what the second step does.
The second step within Lexical Analysis is performed using Regular Expressions, so that the sequence of characters that forms each lexeme is matched with a token class.
The easiest case you can imagine is when the lexeme if
is extracted in the first step, and then in the second step this lexeme is matched with the regular expression
keyword : "if" | "else" | "for" | "while" | "return" | ... // many more
Which you can implement in Python as
import re
pattern = 'r^(if|else|for|while|return)$'
st = 'if'
print(bool(re.match(pattern, st))) # >> True
And so, it will know that “keyword” is the class for if. Now let’s see a more useful example.
Example
Let’s consider this piece of code.
if (x==3)
printf("x is three");
First, we must put ourselves into the Lexical Analyzer shoes. As input, we receive the following string
if (x == 3)\n printf("x is three");\n
Newlines are encoded as the character \n
. Let’s ignore for the moment all issues concerning encoding, and assume everything is Unicode and that the Analyzer is aware of that. A reasonable output will be the following list of tokens.
<"if" , keyword>
<"(" , openPar>
<"x" , identifier>
<"==" , operator>
<"3" , integer>
<")" , closedPar>
<"printf" , keyword>
...
To produce this output, the Lexical Analyzer must:
- Split the source code into units (and this will be a list of strings such as
if
,(
,x
, …). - Assign each of these strings to a class.
Here is where Regular Expressions come into play: the act of assigning strings to a class is done with a pattern-matching operation, where each string is matched to a regular expression.
Conceptually, this is similar to matching the string if
with the pattern if|while|for|...
.
Regarding the above example, you should have at least one question. What happens to whitespaces?
The answer is: it depends on the language. For example, in Python white spaces are quite an important part of the language and when I checked the official Python documentation about Lexical Analysis I discovered that there are special tokens called NEWLINE
, INDENT
, DEDENT
. If you don’t know Python, just know that correct indentendation is mandatory.
Take-home messages
The examples contain a few concepts that are extremely important to remember.
First, and foremost, during the Lexical Analysis everything is a character (or a string). The Lexical Analyzer receives a stream of characters and has to figure out all the rest by itself (tokens, classes, etc.).
Secondly, the objective of the Lexical Analysis is to generate a list of pairs <lexeme, class>
. Formally, each of these pairs is called Token. A lexeme is a piece of the source code (that is, a substring of the entire source code) that is matched to one of the classes defined by the language’s grammar.
Thirdly, lexemes are usually extracted from the source code according to rules that are specific to the language. Such rules are typically built around white spaces. After that, each lexeme is assigned to a class, using regular expressions.
One more point worth remembering is that a lexeme can match more than one class. To know what class is right in that particular place of the source code, the Lexical Analyzer needs to look a bit further down in the source code. This is called lookahead. Lookahead is BAD, but necessary. One objective when designing a new language is to minimize the lookahead needed.
Finally, there is a question of why can’t regular expressions be used to express the whole programming language, instead of the lexeme-matching part only? The reason is that regular expressions are not powerful enough for that. Their expressivity is limited (quite weak, actually), and therefore you could not build a real programming language with them. Programming languages are built with a different tool, Context-Free Grammars.
So, regular expressions are not powerful enough. What does that mean?
Regular Expressions, Finally!
Let’s start with two points that are very important to understand the whole subject, but that are often postponed (or omitted, even) in many books and classes.
- When dealing with programming languages, we work inside an Alphabet. An Alphabet is a set of characters. For example {“0”, “1”} is called the binary alphabet. Once we define our alphabet, everything we write in the source code can only be made of sequence of characters taken from within the alphabet.
- Regular expressions define a language over the alphabet. A language is simply a subset of all possible strings that you can form with that alphabet.
Let’s make the concepts more concrete with one quick example.
Consider the binary alphabet {“0”, “1”}. What are the strings that you can write using this language? They are all sequences of characters that, in each position, have either “0” or “1”. So, “0101010” is a valid string in this language, so is “01110001” and so is “1”.
When you define a regular expression, let’s say
Ones = 1*
then you are also defining the Regular Language called “Ones”. The “*” symbol is used to say “take what’s before this symbol and use it zero or more times”.
The difference between Regular Expressions and Regular Languages may seem subtle, so let’s clarify it.
The language Ones is both:
- A set of strings,
- A subset of all strings that you could form with the alphabet.
In particular, Ones is the set of strings, of any length, made by “1” characters only. So, “11” is a valid string in this language, and so is “1111”. But “1011” is not valid. It’s clear that the strings valid in this language form a subset of all strings you can build with the binary alphabet.
Let’s another example, more realistic. ASCII is a well known Alphabet. You can form a lot of languages with this Alphabet. For instance, the set of all valid C programs is a language over ASCII (although not a regular language, because, as I said before, you can’t express something as complex as C code with regular expressions only).
Before you continue reading, fix these terms in your working memory:
- Alphabet (a set of characters).
- Language (a subset of all strings that you can form mixing characters in the Alphabet).
- Regular Language (a language that can be expressed with regular expressions).
- Sentence (a valid string in the language).
At this point you have already learnt what Regular Expressions do: they express constraints in the way you can mix characters from an Alphabet. And therefore they restrict the way you can build strings from that Alphabet.
That’s what Regular Expressions are: constraints. And as such, they limit the choices that you can make in the Alphabet to form sentences (strings).
Looking from another angle, Regular Expressions are also a way to check whether a given sentence is a valid string for that language. And, as you know by this point, this is why they are used in Compilers.
How to define a Regular Expression
Now you have already grasped all concepts and let’s say you want to write down your own list of regular expressions that describe the classes of the new programming language you are building.
There are a few things you have to add to your knowledge.
First of all, just to make the formalism funnier, you should use in your notes the meaning function $L$, that takes a regular expressions and gives the set of valid sentences.
Then you should be aware of two basic cases.
- The $\epsilon$ regular expression. It generates a set with the empty sentence only, so $L(\epsilon) = {\text{“”}}$.
- The single-char regular expression. For each character
c
in the alphabet, $L(\text{‘c’}) = {\text{“c”}}$, that means the regular expression defined by one character simply defines the one sentence given by that same character.
To make the second point more concrete, imagine that is equivalent to write in Python the following regexp:
import re
pattern = r'^c$'
re.match(pattern, some_string)
# some_string == 'c' would be an easier test!
One additional piece of notation you will have to know is that the +
symbol, when used between two regular expressions, means OR
. So c+d
is a regular expression that says “accept either the single character c
or the single character d
, nothing else”. It’s not uncommon to use the pipe symbol, |
, for the same reason. You’d have in such a case c | d
.
Along with the two basic cases, you should be aware of three compound operations that can be used to form more complex regular expressions. Given two regular expressions, A
and B
, we have:
- $L(A+B) = L(A) \bigcup L(B)$.
- $L(AB) = \{ab\, |\, a \in L(A),\, b \in L(B)\}$.
- $L(A^*) = \bigcup_{i\geq0}L(A^i)$, where $A^i=AA…A$ (that is, $A$ repeated $i$ times).
As surprising as it may sound, the two basic rules and the three compound operations are all you need to form regular expression of any complexity. Let’s see a couple of examples, and everything will be clear.
Example
A few lines above I spoke about “all sentences you can form with the binary alphabet”. I was being very imprecise, and not formal at all in my description.
So, how can you catch all strings that can be generated with the binary alphabet, in one single regular expression?
What about the following one?
(1 + 0)*
Let’s analyze this regexp. By definition of the *
operation we have
$$ (1 + 0)^* = \bigcup_{i\geq 0}(1 + 0)^i $$
Let’s make clear what we already know:
- The exponential operation means to repeat the regular expression, in this case $(1+0)$, $i$-times.
- $1+0$ means: take the character
1
or the character0
(same as0|1
).
So when $i$ equals to $2$ the original regular expression is equivalent to
(1+0)(1+0)
that means: in the first position take either $1$ or $0$; in the second position take either $1$ or $0$.
Since the star-operation is for $i$ from zero to infinity, this means that we are taking strings of any length (from zero to infinity) where at each position we make the choice given by the regexp (1+0)
. And that’s exactly the entire alphabet.
In fact, if you take any character in the alphabet and use the +
notation between them, and then build the star-operation, you will have the whole set of possible strings that can be formed with those characters.
For instance, if the Alphabet is $\{0, 1, 2\}$ then
(0 + 1 + 2)*
will be the set of every possible sentence you can make with those three characters.
Example
A famous example, used in many textbooks and classes, is about a real programming language, Pascal, and how floating point numbers are defined there. Let’s check it out.
num : digits opt_fraction opt_exponent
digits : digit+
digit : '0' + '1' + '2' + '3' + '4' + '5' + '6' + '7' + '8' + '9'
opt_fraction : ('.' digits) + eps
opt_exponent : ('E' + ('+' + '-' + eps) digits) + eps
The whole definition is made by five regular expressions. Let’s understand each of them.
First thing you should notice is that the definitions are somehow nested. You can read the whole things this way:
- A
number
is made bydigits
followed by anoptional fraction
followed by anoptional exponent
. - Digits are made by one or more
digit
characters. - Each digit is one character among ten choices (the ten digits, indeed).
- The optional fraction is made by a dot
.
followed by some digits. - The optional exponent is made by the character
E
, followed by one character among+
and-
followed by digits.
If you stop for a moment and think about this definition, you will see that it includes things like
1137
173.18319
14.189E43
19E-10
These are all valid num
, according to the regular expression. We would say that these strings are accepted by the regular language defined by the regular expression.
It is worth mentioning one more point. You might have noticed the + eps
part in the regular expressions above. You already know that +
means OR (in the sense of the logically inclusive operator), hence it should be easy to see that + eps
means “either what is before or the eps
-regexp”. Since eps
is the empty regular expression, then this all notation means that the argument is optional. It not a coincidence that they called them optional fraction and optional exponent!
In modern regular expression tools there is an easier way to express optional strings, and is by adding a trailing ?
character. So, the fourth regular expression above would become
opt_fraction : ('.' digits)?
with the same meaning as before.
Two more important details
I believe you have two questions at this point.
- The list of regular expressions is typically very long. What happens if a sentence matches more than one regular expression in this list?
- On the other hand, what happens if a sentence does not match any regular expression in the list?
The answer to the first question is rather simple: the order matters. More precisely, the first regular expression (from the top to the bottom) that is matched by the sentence is considered the right one.
The second question requires a bit of thinking. First of all, what does it mean that none of the regular expressions is matched? It means there’s a mistake in the code. Like a typo on the keyboard, such as wihle
instead of while
.
In general there can be all sorts of mistakes in the code. For a compiler, a good error-handling procedure is extremely important.
Taking advantage of the top-down priority rule that I said before, a very common trick is to place at the bottom of the regular expression lists a few more regular expressions that try to catch possible mistakes made by the programmer. Since they are at the bottom, they will only be used when none of the “real” regular expressions is matched, and then the error can be handled in a custom way.
What’s next
Lexical Analysis is just the first step in writing a compiler for a programming language. The remaining steps are Parsing, Semantic Analysis, Optimization (optional) and Code Generation.
A good understanding of regular expressions is a necessary tool to have not only to implement a Lexical Analyzer, but also to get a better understanding of how programming languages work, the algorithmic behind compilers and/or interpreters.
Programming languages are the most used working tools by programmers and software developers, that makes a good understanding of them so important. If you know how they work inside, you are likely to write better code!
References
Python Lexical Analysis documentation page.
The Dragon Book. Compilers: Principles, Techniques and Tools. Wikipedia page (of the book).
Alex Aiken’s class Compilers at Stanford University. Available as free MOOC online.