diff options
Diffstat (limited to 'docs/main/tutorial/LangImpl2.html')
-rw-r--r-- | docs/main/tutorial/LangImpl2.html | 1233 |
1 files changed, 1233 insertions, 0 deletions
diff --git a/docs/main/tutorial/LangImpl2.html b/docs/main/tutorial/LangImpl2.html new file mode 100644 index 0000000..9c13b48 --- /dev/null +++ b/docs/main/tutorial/LangImpl2.html @@ -0,0 +1,1233 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" + "http://www.w3.org/TR/html4/strict.dtd"> + +<html> +<head> + <title>Kaleidoscope: Implementing a Parser and AST</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="author" content="Chris Lattner"> + <link rel="stylesheet" href="../llvm.css" type="text/css"> +</head> + +<body> + +<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div> + +<ul> +<li><a href="index.html">Up to Tutorial Index</a></li> +<li>Chapter 2 + <ol> + <li><a href="#intro">Chapter 2 Introduction</a></li> + <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li> + <li><a href="#parserbasics">Parser Basics</a></li> + <li><a href="#parserprimexprs">Basic Expression Parsing</a></li> + <li><a href="#parserbinops">Binary Expression Parsing</a></li> + <li><a href="#parsertop">Parsing the Rest</a></li> + <li><a href="#driver">The Driver</a></li> + <li><a href="#conclusions">Conclusions</a></li> + <li><a href="#code">Full Code Listing</a></li> + </ol> +</li> +<li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li> +</ul> + +<div class="doc_author"> + <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p> +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language +with LLVM</a>" tutorial. This chapter shows you how to use the lexer, built in +<a href="LangImpl1.html">Chapter 1</a>, to build a full <a +href="http://en.wikipedia.org/wiki/Parsing">parser</a> for +our Kaleidoscope language. Once we have a parser, we'll define and build an <a +href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax +Tree</a> (AST).</p> + +<p>The parser we will build uses a combination of <a +href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent +Parsing</a> and <a href= +"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence +Parsing</a> to parse the Kaleidoscope language (the latter for +binary expressions and the former for everything else). Before we get to +parsing though, lets talk about the output of the parser: the Abstract Syntax +Tree.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>The AST for a program captures its behavior in such a way that it is easy for +later stages of the compiler (e.g. code generation) to interpret. We basically +want one object for each construct in the language, and the AST should closely +model the language. In Kaleidoscope, we have expressions, a prototype, and a +function object. We'll start with expressions first:</p> + +<div class="doc_code"> +<pre> +/// ExprAST - Base class for all expression nodes. +class ExprAST { +public: + virtual ~ExprAST() {} +}; + +/// NumberExprAST - Expression class for numeric literals like "1.0". +class NumberExprAST : public ExprAST { + double Val; +public: + NumberExprAST(double val) : Val(val) {} +}; +</pre> +</div> + +<p>The code above shows the definition of the base ExprAST class and one +subclass which we use for numeric literals. The important thing to note about +this code is that the NumberExprAST class captures the numeric value of the +literal as an instance variable. This allows later phases of the compiler to +know what the stored numeric value is.</p> + +<p>Right now we only create the AST, so there are no useful accessor methods on +them. It would be very easy to add a virtual method to pretty print the code, +for example. Here are the other expression AST node definitions that we'll use +in the basic form of the Kaleidoscope language: +</p> + +<div class="doc_code"> +<pre> +/// VariableExprAST - Expression class for referencing a variable, like "a". +class VariableExprAST : public ExprAST { + std::string Name; +public: + VariableExprAST(const std::string &name) : Name(name) {} +}; + +/// BinaryExprAST - Expression class for a binary operator. +class BinaryExprAST : public ExprAST { + char Op; + ExprAST *LHS, *RHS; +public: + BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) + : Op(op), LHS(lhs), RHS(rhs) {} +}; + +/// CallExprAST - Expression class for function calls. +class CallExprAST : public ExprAST { + std::string Callee; + std::vector<ExprAST*> Args; +public: + CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) + : Callee(callee), Args(args) {} +}; +</pre> +</div> + +<p>This is all (intentionally) rather straight-forward: variables capture the +variable name, binary operators capture their opcode (e.g. '+'), and calls +capture a function name as well as a list of any argument expressions. One thing +that is nice about our AST is that it captures the language features without +talking about the syntax of the language. Note that there is no discussion about +precedence of binary operators, lexical structure, etc.</p> + +<p>For our basic language, these are all of the expression nodes we'll define. +Because it doesn't have conditional control flow, it isn't Turing-complete; +we'll fix that in a later installment. The two things we need next are a way +to talk about the interface to a function, and a way to talk about functions +themselves:</p> + +<div class="doc_code"> +<pre> +/// PrototypeAST - This class represents the "prototype" for a function, +/// which captures its name, and its argument names (thus implicitly the number +/// of arguments the function takes). +class PrototypeAST { + std::string Name; + std::vector<std::string> Args; +public: + PrototypeAST(const std::string &name, const std::vector<std::string> &args) + : Name(name), Args(args) {} +}; + +/// FunctionAST - This class represents a function definition itself. +class FunctionAST { + PrototypeAST *Proto; + ExprAST *Body; +public: + FunctionAST(PrototypeAST *proto, ExprAST *body) + : Proto(proto), Body(body) {} +}; +</pre> +</div> + +<p>In Kaleidoscope, functions are typed with just a count of their arguments. +Since all values are double precision floating point, the type of each argument +doesn't need to be stored anywhere. In a more aggressive and realistic +language, the "ExprAST" class would probably have a type field.</p> + +<p>With this scaffolding, we can now talk about parsing expressions and function +bodies in Kaleidoscope.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="parserbasics">Parser Basics</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>Now that we have an AST to build, we need to define the parser code to build +it. The idea here is that we want to parse something like "x+y" (which is +returned as three tokens by the lexer) into an AST that could be generated with +calls like this:</p> + +<div class="doc_code"> +<pre> + ExprAST *X = new VariableExprAST("x"); + ExprAST *Y = new VariableExprAST("y"); + ExprAST *Result = new BinaryExprAST('+', X, Y); +</pre> +</div> + +<p>In order to do this, we'll start by defining some basic helper routines:</p> + +<div class="doc_code"> +<pre> +/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current +/// token the parser is looking at. getNextToken reads another token from the +/// lexer and updates CurTok with its results. +static int CurTok; +static int getNextToken() { + return CurTok = gettok(); +} +</pre> +</div> + +<p> +This implements a simple token buffer around the lexer. This allows +us to look one token ahead at what the lexer is returning. Every function in +our parser will assume that CurTok is the current token that needs to be +parsed.</p> + +<div class="doc_code"> +<pre> + +/// Error* - These are little helper functions for error handling. +ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} +PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } +FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } +</pre> +</div> + +<p> +The <tt>Error</tt> routines are simple helper routines that our parser will use +to handle errors. The error recovery in our parser will not be the best and +is not particular user-friendly, but it will be enough for our tutorial. These +routines make it easier to handle errors in routines that have various return +types: they always return null.</p> + +<p>With these basic helper functions, we can implement the first +piece of our grammar: numeric literals.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="parserprimexprs">Basic Expression + Parsing</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>We start with numeric literals, because they are the simplest to process. +For each production in our grammar, we'll define a function which parses that +production. For numeric literals, we have: +</p> + +<div class="doc_code"> +<pre> +/// numberexpr ::= number +static ExprAST *ParseNumberExpr() { + ExprAST *Result = new NumberExprAST(NumVal); + getNextToken(); // consume the number + return Result; +} +</pre> +</div> + +<p>This routine is very simple: it expects to be called when the current token +is a <tt>tok_number</tt> token. It takes the current number value, creates +a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally +returns.</p> + +<p>There are some interesting aspects to this. The most important one is that +this routine eats all of the tokens that correspond to the production and +returns the lexer buffer with the next token (which is not part of the grammar +production) ready to go. This is a fairly standard way to go for recursive +descent parsers. For a better example, the parenthesis operator is defined like +this:</p> + +<div class="doc_code"> +<pre> +/// parenexpr ::= '(' expression ')' +static ExprAST *ParseParenExpr() { + getNextToken(); // eat (. + ExprAST *V = ParseExpression(); + if (!V) return 0; + + if (CurTok != ')') + return Error("expected ')'"); + getNextToken(); // eat ). + return V; +} +</pre> +</div> + +<p>This function illustrates a number of interesting things about the +parser:</p> + +<p> +1) It shows how we use the Error routines. When called, this function expects +that the current token is a '(' token, but after parsing the subexpression, it +is possible that there is no ')' waiting. For example, if the user types in +"(4 x" instead of "(4)", the parser should emit an error. Because errors can +occur, the parser needs a way to indicate that they happened: in our parser, we +return null on an error.</p> + +<p>2) Another interesting aspect of this function is that it uses recursion by +calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call +<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle +recursive grammars, and keeps each production very simple. Note that +parentheses do not cause construction of AST nodes themselves. While we could +do it this way, the most important role of parentheses are to guide the parser +and provide grouping. Once the parser constructs the AST, parentheses are not +needed.</p> + +<p>The next simple production is for handling variable references and function +calls:</p> + +<div class="doc_code"> +<pre> +/// identifierexpr +/// ::= identifier +/// ::= identifier '(' expression* ')' +static ExprAST *ParseIdentifierExpr() { + std::string IdName = IdentifierStr; + + getNextToken(); // eat identifier. + + if (CurTok != '(') // Simple variable ref. + return new VariableExprAST(IdName); + + // Call. + getNextToken(); // eat ( + std::vector<ExprAST*> Args; + if (CurTok != ')') { + while (1) { + ExprAST *Arg = ParseExpression(); + if (!Arg) return 0; + Args.push_back(Arg); + + if (CurTok == ')') break; + + if (CurTok != ',') + return Error("Expected ')' or ',' in argument list"); + getNextToken(); + } + } + + // Eat the ')'. + getNextToken(); + + return new CallExprAST(IdName, Args); +} +</pre> +</div> + +<p>This routine follows the same style as the other routines. (It expects to be +called if the current token is a <tt>tok_identifier</tt> token). It also has +recursion and error handling. One interesting aspect of this is that it uses +<em>look-ahead</em> to determine if the current identifier is a stand alone +variable reference or if it is a function call expression. It handles this by +checking to see if the token after the identifier is a '(' token, constructing +either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate. +</p> + +<p>Now that we have all of our simple expression-parsing logic in place, we can +define a helper function to wrap it together into one entry point. We call this +class of expressions "primary" expressions, for reasons that will become more +clear <a href="LangImpl6.html#unary">later in the tutorial</a>. In order to +parse an arbitrary primary expression, we need to determine what sort of +expression it is:</p> + +<div class="doc_code"> +<pre> +/// primary +/// ::= identifierexpr +/// ::= numberexpr +/// ::= parenexpr +static ExprAST *ParsePrimary() { + switch (CurTok) { + default: return Error("unknown token when expecting an expression"); + case tok_identifier: return ParseIdentifierExpr(); + case tok_number: return ParseNumberExpr(); + case '(': return ParseParenExpr(); + } +} +</pre> +</div> + +<p>Now that you see the definition of this function, it is more obvious why we +can assume the state of CurTok in the various functions. This uses look-ahead +to determine which sort of expression is being inspected, and then parses it +with a function call.</p> + +<p>Now that basic expressions are handled, we need to handle binary expressions. +They are a bit more complex.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="parserbinops">Binary Expression + Parsing</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>Binary expressions are significantly harder to parse because they are often +ambiguous. For example, when given the string "x+y*z", the parser can choose +to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from +mathematics, we expect the later parse, because "*" (multiplication) has +higher <em>precedence</em> than "+" (addition).</p> + +<p>There are many ways to handle this, but an elegant and efficient way is to +use <a href= +"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence +Parsing</a>. This parsing technique uses the precedence of binary operators to +guide recursion. To start with, we need a table of precedences:</p> + +<div class="doc_code"> +<pre> +/// BinopPrecedence - This holds the precedence for each binary operator that is +/// defined. +static std::map<char, int> BinopPrecedence; + +/// GetTokPrecedence - Get the precedence of the pending binary operator token. +static int GetTokPrecedence() { + if (!isascii(CurTok)) + return -1; + + // Make sure it's a declared binop. + int TokPrec = BinopPrecedence[CurTok]; + if (TokPrec <= 0) return -1; + return TokPrec; +} + +int main() { + // Install standard binary operators. + // 1 is lowest precedence. + BinopPrecedence['<'] = 10; + BinopPrecedence['+'] = 20; + BinopPrecedence['-'] = 20; + BinopPrecedence['*'] = 40; // highest. + ... +} +</pre> +</div> + +<p>For the basic form of Kaleidoscope, we will only support 4 binary operators +(this can obviously be extended by you, our brave and intrepid reader). The +<tt>GetTokPrecedence</tt> function returns the precedence for the current token, +or -1 if the token is not a binary operator. Having a map makes it easy to add +new operators and makes it clear that the algorithm doesn't depend on the +specific operators involved, but it would be easy enough to eliminate the map +and do the comparisons in the <tt>GetTokPrecedence</tt> function. (Or just use +a fixed-size array).</p> + +<p>With the helper above defined, we can now start parsing binary expressions. +The basic idea of operator precedence parsing is to break down an expression +with potentially ambiguous binary operators into pieces. Consider ,for example, +the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this +as a stream of primary expressions separated by binary operators. As such, +it will first parse the leading primary expression "a", then it will see the +pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses +are primary expressions, the binary expression parser doesn't need to worry +about nested subexpressions like (c+d) at all. +</p> + +<p> +To start, an expression is a primary expression potentially followed by a +sequence of [binop,primaryexpr] pairs:</p> + +<div class="doc_code"> +<pre> +/// expression +/// ::= primary binoprhs +/// +static ExprAST *ParseExpression() { + ExprAST *LHS = ParsePrimary(); + if (!LHS) return 0; + + return ParseBinOpRHS(0, LHS); +} +</pre> +</div> + +<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for +us. It takes a precedence and a pointer to an expression for the part that has been +parsed so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is +allowed to be empty, in which case it returns the expression that is passed into +it. In our example above, the code passes the expression for "a" into +<tt>ParseBinOpRHS</tt> and the current token is "+".</p> + +<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em> +minimal operator precedence</em> that the function is allowed to eat. For +example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is +passed in a precedence of 40, it will not consume any tokens (because the +precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts +with:</p> + +<div class="doc_code"> +<pre> +/// binoprhs +/// ::= ('+' primary)* +static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) { + // If this is a binop, find its precedence. + while (1) { + int TokPrec = GetTokPrecedence(); + + // If this is a binop that binds at least as tightly as the current binop, + // consume it, otherwise we are done. + if (TokPrec < ExprPrec) + return LHS; +</pre> +</div> + +<p>This code gets the precedence of the current token and checks to see if if is +too low. Because we defined invalid tokens to have a precedence of -1, this +check implicitly knows that the pair-stream ends when the token stream runs out +of binary operators. If this check succeeds, we know that the token is a binary +operator and that it will be included in this expression:</p> + +<div class="doc_code"> +<pre> + // Okay, we know this is a binop. + int BinOp = CurTok; + getNextToken(); // eat binop + + // Parse the primary expression after the binary operator. + ExprAST *RHS = ParsePrimary(); + if (!RHS) return 0; +</pre> +</div> + +<p>As such, this code eats (and remembers) the binary operator and then parses +the primary expression that follows. This builds up the whole pair, the first of +which is [+, b] for the running example.</p> + +<p>Now that we parsed the left-hand side of an expression and one pair of the +RHS sequence, we have to decide which way the expression associates. In +particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)". +To determine this, we look ahead at "binop" to determine its precedence and +compare it to BinOp's precedence (which is '+' in this case):</p> + +<div class="doc_code"> +<pre> + // If BinOp binds less tightly with RHS than the operator after RHS, let + // the pending operator take RHS as its LHS. + int NextPrec = GetTokPrecedence(); + if (TokPrec < NextPrec) { +</pre> +</div> + +<p>If the precedence of the binop to the right of "RHS" is lower or equal to the +precedence of our current operator, then we know that the parentheses associate +as "(a+b) binop ...". In our example, the current operator is "+" and the next +operator is "+", we know that they have the same precedence. In this case we'll +create the AST node for "a+b", and then continue parsing:</p> + +<div class="doc_code"> +<pre> + ... if body omitted ... + } + + // Merge LHS/RHS. + LHS = new BinaryExprAST(BinOp, LHS, RHS); + } // loop around to the top of the while loop. +} +</pre> +</div> + +<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next +iteration of the loop, with "+" as the current token. The code above will eat, +remember, and parse "(c+d)" as the primary expression, which makes the +current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with +"*" as the binop to the right of the primary. In this case, the precedence of "*" is +higher than the precedence of "+" so the if condition will be entered.</p> + +<p>The critical question left here is "how can the if condition parse the right +hand side in full"? In particular, to build the AST correctly for our example, +it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to +do this is surprisingly simple (code from the above two blocks duplicated for +context):</p> + +<div class="doc_code"> +<pre> + // If BinOp binds less tightly with RHS than the operator after RHS, let + // the pending operator take RHS as its LHS. + int NextPrec = GetTokPrecedence(); + if (TokPrec < NextPrec) { + <b>RHS = ParseBinOpRHS(TokPrec+1, RHS); + if (RHS == 0) return 0;</b> + } + // Merge LHS/RHS. + LHS = new BinaryExprAST(BinOp, LHS, RHS); + } // loop around to the top of the while loop. +} +</pre> +</div> + +<p>At this point, we know that the binary operator to the RHS of our primary +has higher precedence than the binop we are currently parsing. As such, we know +that any sequence of pairs whose operators are all higher precedence than "+" +should be parsed together and returned as "RHS". To do this, we recursively +invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum +precedence required for it to continue. In our example above, this will cause +it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS +of the '+' expression.</p> + +<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed +and added to the AST. With this little bit of code (14 non-trivial lines), we +correctly handle fully general binary expression parsing in a very elegant way. +This was a whirlwind tour of this code, and it is somewhat subtle. I recommend +running through it with a few tough examples to see how it works. +</p> + +<p>This wraps up handling of expressions. At this point, we can point the +parser at an arbitrary token stream and build an expression from it, stopping +at the first token that is not part of the expression. Next up we need to +handle function definitions, etc.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p> +The next thing missing is handling of function prototypes. In Kaleidoscope, +these are used both for 'extern' function declarations as well as function body +definitions. The code to do this is straight-forward and not very interesting +(once you've survived expressions): +</p> + +<div class="doc_code"> +<pre> +/// prototype +/// ::= id '(' id* ')' +static PrototypeAST *ParsePrototype() { + if (CurTok != tok_identifier) + return ErrorP("Expected function name in prototype"); + + std::string FnName = IdentifierStr; + getNextToken(); + + if (CurTok != '(') + return ErrorP("Expected '(' in prototype"); + + // Read the list of argument names. + std::vector<std::string> ArgNames; + while (getNextToken() == tok_identifier) + ArgNames.push_back(IdentifierStr); + if (CurTok != ')') + return ErrorP("Expected ')' in prototype"); + + // success. + getNextToken(); // eat ')'. + + return new PrototypeAST(FnName, ArgNames); +} +</pre> +</div> + +<p>Given this, a function definition is very simple, just a prototype plus +an expression to implement the body:</p> + +<div class="doc_code"> +<pre> +/// definition ::= 'def' prototype expression +static FunctionAST *ParseDefinition() { + getNextToken(); // eat def. + PrototypeAST *Proto = ParsePrototype(); + if (Proto == 0) return 0; + + if (ExprAST *E = ParseExpression()) + return new FunctionAST(Proto, E); + return 0; +} +</pre> +</div> + +<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as +well as to support forward declaration of user functions. These 'extern's are just +prototypes with no body:</p> + +<div class="doc_code"> +<pre> +/// external ::= 'extern' prototype +static PrototypeAST *ParseExtern() { + getNextToken(); // eat extern. + return ParsePrototype(); +} +</pre> +</div> + +<p>Finally, we'll also let the user type in arbitrary top-level expressions and +evaluate them on the fly. We will handle this by defining anonymous nullary +(zero argument) functions for them:</p> + +<div class="doc_code"> +<pre> +/// toplevelexpr ::= expression +static FunctionAST *ParseTopLevelExpr() { + if (ExprAST *E = ParseExpression()) { + // Make an anonymous proto. + PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>()); + return new FunctionAST(Proto, E); + } + return 0; +} +</pre> +</div> + +<p>Now that we have all the pieces, let's build a little driver that will let us +actually <em>execute</em> this code we've built!</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="driver">The Driver</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>The driver for this simply invokes all of the parsing pieces with a top-level +dispatch loop. There isn't much interesting here, so I'll just include the +top-level loop. See <a href="#code">below</a> for full code in the "Top-Level +Parsing" section.</p> + +<div class="doc_code"> +<pre> +/// top ::= definition | external | expression | ';' +static void MainLoop() { + while (1) { + fprintf(stderr, "ready> "); + switch (CurTok) { + case tok_eof: return; + case ';': getNextToken(); break; // ignore top-level semicolons. + case tok_def: HandleDefinition(); break; + case tok_extern: HandleExtern(); break; + default: HandleTopLevelExpression(); break; + } + } +} +</pre> +</div> + +<p>The most interesting part of this is that we ignore top-level semicolons. +Why is this, you ask? The basic reason is that if you type "4 + 5" at the +command line, the parser doesn't know whether that is the end of what you will type +or not. For example, on the next line you could type "def foo..." in which case +4+5 is the end of a top-level expression. Alternatively you could type "* 6", +which would continue the expression. Having top-level semicolons allows you to +type "4+5;", and the parser will know you are done.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="conclusions">Conclusions</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>With just under 400 lines of commented code (240 lines of non-comment, +non-blank code), we fully defined our minimal language, including a lexer, +parser, and AST builder. With this done, the executable will validate +Kaleidoscope code and tell us if it is grammatically invalid. For +example, here is a sample interaction:</p> + +<div class="doc_code"> +<pre> +$ <b>./a.out</b> +ready> <b>def foo(x y) x+foo(y, 4.0);</b> +Parsed a function definition. +ready> <b>def foo(x y) x+y y;</b> +Parsed a function definition. +Parsed a top-level expr +ready> <b>def foo(x y) x+y );</b> +Parsed a function definition. +Error: unknown token when expecting an expression +ready> <b>extern sin(a);</b> +ready> Parsed an extern +ready> <b>^D</b> +$ +</pre> +</div> + +<p>There is a lot of room for extension here. You can define new AST nodes, +extend the language in many ways, etc. In the <a href="LangImpl3.html">next +installment</a>, we will describe how to generate LLVM Intermediate +Representation (IR) from the AST.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="code">Full Code Listing</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p> +Here is the complete code listing for this and the previous chapter. +Note that it is fully self-contained: you don't need LLVM or any external +libraries at all for this. (Besides the C and C++ standard libraries, of +course.) To build this, just compile with:</p> + +<div class="doc_code"> +<pre> + # Compile + g++ -g -O3 toy.cpp + # Run + ./a.out +</pre> +</div> + +<p>Here is the code:</p> + +<div class="doc_code"> +<pre> +#include <cstdio> +#include <cstdlib> +#include <string> +#include <map> +#include <vector> + +//===----------------------------------------------------------------------===// +// Lexer +//===----------------------------------------------------------------------===// + +// The lexer returns tokens [0-255] if it is an unknown character, otherwise one +// of these for known things. +enum Token { + tok_eof = -1, + + // commands + tok_def = -2, tok_extern = -3, + + // primary + tok_identifier = -4, tok_number = -5 +}; + +static std::string IdentifierStr; // Filled in if tok_identifier +static double NumVal; // Filled in if tok_number + +/// gettok - Return the next token from standard input. +static int gettok() { + static int LastChar = ' '; + + // Skip any whitespace. + while (isspace(LastChar)) + LastChar = getchar(); + + if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* + IdentifierStr = LastChar; + while (isalnum((LastChar = getchar()))) + IdentifierStr += LastChar; + + if (IdentifierStr == "def") return tok_def; + if (IdentifierStr == "extern") return tok_extern; + return tok_identifier; + } + + if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+ + std::string NumStr; + do { + NumStr += LastChar; + LastChar = getchar(); + } while (isdigit(LastChar) || LastChar == '.'); + + NumVal = strtod(NumStr.c_str(), 0); + return tok_number; + } + + if (LastChar == '#') { + // Comment until end of line. + do LastChar = getchar(); + while (LastChar != EOF && LastChar != '\n' && LastChar != '\r'); + + if (LastChar != EOF) + return gettok(); + } + + // Check for end of file. Don't eat the EOF. + if (LastChar == EOF) + return tok_eof; + + // Otherwise, just return the character as its ascii value. + int ThisChar = LastChar; + LastChar = getchar(); + return ThisChar; +} + +//===----------------------------------------------------------------------===// +// Abstract Syntax Tree (aka Parse Tree) +//===----------------------------------------------------------------------===// + +/// ExprAST - Base class for all expression nodes. +class ExprAST { +public: + virtual ~ExprAST() {} +}; + +/// NumberExprAST - Expression class for numeric literals like "1.0". +class NumberExprAST : public ExprAST { + double Val; +public: + NumberExprAST(double val) : Val(val) {} +}; + +/// VariableExprAST - Expression class for referencing a variable, like "a". +class VariableExprAST : public ExprAST { + std::string Name; +public: + VariableExprAST(const std::string &name) : Name(name) {} +}; + +/// BinaryExprAST - Expression class for a binary operator. +class BinaryExprAST : public ExprAST { + char Op; + ExprAST *LHS, *RHS; +public: + BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) + : Op(op), LHS(lhs), RHS(rhs) {} +}; + +/// CallExprAST - Expression class for function calls. +class CallExprAST : public ExprAST { + std::string Callee; + std::vector<ExprAST*> Args; +public: + CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) + : Callee(callee), Args(args) {} +}; + +/// PrototypeAST - This class represents the "prototype" for a function, +/// which captures its name, and its argument names (thus implicitly the number +/// of arguments the function takes). +class PrototypeAST { + std::string Name; + std::vector<std::string> Args; +public: + PrototypeAST(const std::string &name, const std::vector<std::string> &args) + : Name(name), Args(args) {} + +}; + +/// FunctionAST - This class represents a function definition itself. +class FunctionAST { + PrototypeAST *Proto; + ExprAST *Body; +public: + FunctionAST(PrototypeAST *proto, ExprAST *body) + : Proto(proto), Body(body) {} + +}; + +//===----------------------------------------------------------------------===// +// Parser +//===----------------------------------------------------------------------===// + +/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current +/// token the parser is looking at. getNextToken reads another token from the +/// lexer and updates CurTok with its results. +static int CurTok; +static int getNextToken() { + return CurTok = gettok(); +} + +/// BinopPrecedence - This holds the precedence for each binary operator that is +/// defined. +static std::map<char, int> BinopPrecedence; + +/// GetTokPrecedence - Get the precedence of the pending binary operator token. +static int GetTokPrecedence() { + if (!isascii(CurTok)) + return -1; + + // Make sure it's a declared binop. + int TokPrec = BinopPrecedence[CurTok]; + if (TokPrec <= 0) return -1; + return TokPrec; +} + +/// Error* - These are little helper functions for error handling. +ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} +PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } +FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } + +static ExprAST *ParseExpression(); + +/// identifierexpr +/// ::= identifier +/// ::= identifier '(' expression* ')' +static ExprAST *ParseIdentifierExpr() { + std::string IdName = IdentifierStr; + + getNextToken(); // eat identifier. + + if (CurTok != '(') // Simple variable ref. + return new VariableExprAST(IdName); + + // Call. + getNextToken(); // eat ( + std::vector<ExprAST*> Args; + if (CurTok != ')') { + while (1) { + ExprAST *Arg = ParseExpression(); + if (!Arg) return 0; + Args.push_back(Arg); + + if (CurTok == ')') break; + + if (CurTok != ',') + return Error("Expected ')' or ',' in argument list"); + getNextToken(); + } + } + + // Eat the ')'. + getNextToken(); + + return new CallExprAST(IdName, Args); +} + +/// numberexpr ::= number +static ExprAST *ParseNumberExpr() { + ExprAST *Result = new NumberExprAST(NumVal); + getNextToken(); // consume the number + return Result; +} + +/// parenexpr ::= '(' expression ')' +static ExprAST *ParseParenExpr() { + getNextToken(); // eat (. + ExprAST *V = ParseExpression(); + if (!V) return 0; + + if (CurTok != ')') + return Error("expected ')'"); + getNextToken(); // eat ). + return V; +} + +/// primary +/// ::= identifierexpr +/// ::= numberexpr +/// ::= parenexpr +static ExprAST *ParsePrimary() { + switch (CurTok) { + default: return Error("unknown token when expecting an expression"); + case tok_identifier: return ParseIdentifierExpr(); + case tok_number: return ParseNumberExpr(); + case '(': return ParseParenExpr(); + } +} + +/// binoprhs +/// ::= ('+' primary)* +static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) { + // If this is a binop, find its precedence. + while (1) { + int TokPrec = GetTokPrecedence(); + + // If this is a binop that binds at least as tightly as the current binop, + // consume it, otherwise we are done. + if (TokPrec < ExprPrec) + return LHS; + + // Okay, we know this is a binop. + int BinOp = CurTok; + getNextToken(); // eat binop + + // Parse the primary expression after the binary operator. + ExprAST *RHS = ParsePrimary(); + if (!RHS) return 0; + + // If BinOp binds less tightly with RHS than the operator after RHS, let + // the pending operator take RHS as its LHS. + int NextPrec = GetTokPrecedence(); + if (TokPrec < NextPrec) { + RHS = ParseBinOpRHS(TokPrec+1, RHS); + if (RHS == 0) return 0; + } + + // Merge LHS/RHS. + LHS = new BinaryExprAST(BinOp, LHS, RHS); + } +} + +/// expression +/// ::= primary binoprhs +/// +static ExprAST *ParseExpression() { + ExprAST *LHS = ParsePrimary(); + if (!LHS) return 0; + + return ParseBinOpRHS(0, LHS); +} + +/// prototype +/// ::= id '(' id* ')' +static PrototypeAST *ParsePrototype() { + if (CurTok != tok_identifier) + return ErrorP("Expected function name in prototype"); + + std::string FnName = IdentifierStr; + getNextToken(); + + if (CurTok != '(') + return ErrorP("Expected '(' in prototype"); + + std::vector<std::string> ArgNames; + while (getNextToken() == tok_identifier) + ArgNames.push_back(IdentifierStr); + if (CurTok != ')') + return ErrorP("Expected ')' in prototype"); + + // success. + getNextToken(); // eat ')'. + + return new PrototypeAST(FnName, ArgNames); +} + +/// definition ::= 'def' prototype expression +static FunctionAST *ParseDefinition() { + getNextToken(); // eat def. + PrototypeAST *Proto = ParsePrototype(); + if (Proto == 0) return 0; + + if (ExprAST *E = ParseExpression()) + return new FunctionAST(Proto, E); + return 0; +} + +/// toplevelexpr ::= expression +static FunctionAST *ParseTopLevelExpr() { + if (ExprAST *E = ParseExpression()) { + // Make an anonymous proto. + PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>()); + return new FunctionAST(Proto, E); + } + return 0; +} + +/// external ::= 'extern' prototype +static PrototypeAST *ParseExtern() { + getNextToken(); // eat extern. + return ParsePrototype(); +} + +//===----------------------------------------------------------------------===// +// Top-Level parsing +//===----------------------------------------------------------------------===// + +static void HandleDefinition() { + if (ParseDefinition()) { + fprintf(stderr, "Parsed a function definition.\n"); + } else { + // Skip token for error recovery. + getNextToken(); + } +} + +static void HandleExtern() { + if (ParseExtern()) { + fprintf(stderr, "Parsed an extern\n"); + } else { + // Skip token for error recovery. + getNextToken(); + } +} + +static void HandleTopLevelExpression() { + // Evaluate a top-level expression into an anonymous function. + if (ParseTopLevelExpr()) { + fprintf(stderr, "Parsed a top-level expr\n"); + } else { + // Skip token for error recovery. + getNextToken(); + } +} + +/// top ::= definition | external | expression | ';' +static void MainLoop() { + while (1) { + fprintf(stderr, "ready> "); + switch (CurTok) { + case tok_eof: return; + case ';': getNextToken(); break; // ignore top-level semicolons. + case tok_def: HandleDefinition(); break; + case tok_extern: HandleExtern(); break; + default: HandleTopLevelExpression(); break; + } + } +} + +//===----------------------------------------------------------------------===// +// Main driver code. +//===----------------------------------------------------------------------===// + +int main() { + // Install standard binary operators. + // 1 is lowest precedence. + BinopPrecedence['<'] = 10; + BinopPrecedence['+'] = 20; + BinopPrecedence['-'] = 20; + BinopPrecedence['*'] = 40; // highest. + + // Prime the first token. + fprintf(stderr, "ready> "); + getNextToken(); + + // Run the main "interpreter loop" now. + MainLoop(); + + return 0; +} +</pre> +</div> +<a href="LangImpl3.html">Next: Implementing Code Generation to LLVM IR</a> +</div> + +<!-- *********************************************************************** --> +<hr> +<address> + <a href="http://jigsaw.w3.org/css-validator/check/referer"><img + src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> + <a href="http://validator.w3.org/check/referer"><img + src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> + + <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> + <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> + Last modified: $Date$ +</address> +</body> +</html> |