This project requires the implementation of the query compiler. A database query compiler takes as
input a SQL query and generates as output an execution tree consisting of relational algebra operators. To
accomplish this task, it makes heavy use of the database catalog in order to identify table and attribute
properties. There are two main components in the query compiler: the query parser and the query optimizer.
The query parser transforms the text of a SQL query into an abstract representation that extracts the defining
elements required for the execution of the query. Such elements are represented by the tables appearing in
the query, the attributes, the predicates, the aggregates, etc. As with any programming language parser,
there are two components in a SQL query parser: the lexer and the parser. The lexer employs lexical analysis
over the SQL text in order to identify the language tokens, such as keywords, e.g., SELECT, FROM, WHERE, etc.,
names of tables and attributes, and constants. The parser enforces that the SQL query obeys the syntax or
structure of the language, e.g., SELECT-FROM-WHERE is a correct query, while SELECT-WHERE is not correct.
You do not have to implement a SQL query parser from scratch because we have already implemented one
for you. The query parser is implemented using the well-known packages lex (flex) and yacc (bison).
This parser supports a subset of SQL, given by the following rules:
• SELECT SelectAtts FROM Tables WHERE AndList
• SELECT SelectAtts FROM Tables WHERE AndList GROUP BY Atts
where only AND conditions involving at least one attribute are allowed in the WHERE clause. The only operators
supported in WHERE predicates are <, >, and =. As in standard SQL, Tables is a list of tables (no aliases are
allowed, thus a table can appear only a single time) and Atts is a list of grouping attributes. SelectAtts
is defined as follows:
• SelectAtts := Function ’,’ Atts | Function | Atts | DISTINCT Atts
where Function is SUM of any arithmetic expression involving attributes and constants and Atts is a list
of attributes. While three types of attributes are supported in the system (Integer, Float, String),
Function can operate only over Integer and Float.
The exhaustive list of tokens supported by the language is given in file QueryLexer.l, while all the
syntactical constructs are included in file QueryParser.y. It is recommended that you familiarize with these
files and not change them for the time being. There is plenty of documentation available online if you search
for “lex” and “yacc” in Google.
The output of the parser is composed of multiple lists populated with the principal elements of the query.
They are defined in QueryParser.y (the type of these lists is defined in file ParseTree.h) and included
below for reference:
struct FuncOperator* finalFunction; // the aggregate function
struct TableList* tables; // the list of tables in the query
struct AndList* predicate; // the predicate in WHERE
struct NameList* groupingAtts; // grouping attributes
struct NameList* attsToSelect; // the attributes in SELECT
int distinctAtts; // 1 if there is a DISTINCT in a non-aggregate query
For this phase of the project, these will represent your only interaction with the query parser. You do not
need to worry about the details of how everything is implemented and how it works. Nonetheless, feel free
to get your hands dirty and figure out all the details.
The query compiler takes as input the data structures produced by the parser and transforms them into
a query execution tree of relational algebra operators. While it is the job of the compiler to figure out
and configure the necessary operators appropriately, the query optimizer is responsible for determining the
structure of the tree, specifically the order of the joins. File RelOp.h contains the definition of all the
relational operators considered in this project, together with their member attributes. The main part of this
project stage consists in the creation of instances of these operators from the data structures produced by the
query parser. Essentially, you have to implement the constructor of each operator class and configure it with
the corresponding elements, extracted from the data structures produced by the parser. In the following, we
describe the elements of each type of operator:
• Scan (Schema& schema, DBFile& file) manages the access to the data file. There is a Scan instance for every base table in the SQL query. It has to be initialized with the schema of the records in
the file and a DBFile object that provides access to the records. Both these can be instantiated from
the metadata in the catalog.
• Select (Schema& schema, CNF& predicate, Record& constants, RelationalOp* producer)
implements the selection relational algebra operator. It has to be initialized with the schema of the
records taken as input, the selection predicate, the constants appearing in the predicate, and the
operator producing the tuples. Class CNF in file Comparison.h implements functionality to handle
predicates from WHERE. A CNF object consists of one or multiple Comparison objects connected by AND,
where a Comparison is a predicate of the form A op B. op can be one of <, >, and =. At least
one of A and B has to be an attribute in schema. The Record constants contains all the constants
appearing in comparisons in the CNF. Method ExtractCNF (AndList& parseTree, Schema& schema,
Record& literal) in class CNF extracts the CNF and constants corresponding to a given schema from
the representation of the WHERE clause produced by the parser.
• Project (Schema& schemaIn, Schema& schemaOut, int numAttsInput, int numAttsOutput,
int* keepMe, RelationalOp* producer) implements the projection operator. It has to be configured with the schema of the input records, the schema of the output records, the attributes to be
kept, and the producer operator. The projected attributes are specified by their position in the input
schema. They are grouped in the array keepMe of size numAttsOutput.
• Join (Schema& schemaLeft, Schema& schemaRight, Schema& schemaOut, CNF& predicate,
RelationalOp* left, RelationalOp* right) implements the join operator. It has to be configured with the schemas and the producing operators, the output schema (which is just the union of the
input schemas), and the join predicate (represented as a CNF). In this case, each Comparison in the CNF
has to contain one attribute from the left schema and one attribute from the right schema. Method
ExtractCNF (AndList& parseTree, Schema& leftSchema, Schema& rightSchema) from class CNF
extracts the join predicate corresponding to two schemas from the WHERE clause.
• DuplicateRemoval (Schema& schema, RelationalOp* producer) implements the DISTINCT operator. It has to be initialized with the schema of the input records and the producer operator.
• Sum (Schema& schemaIn, Schema& schemaOut, Function& compute, RelationalOp* producer)
implements the aggregate operator. The schema of the input records, the schema of the output record,
the function applied to attributes in the input records, and the producer operator have to be supplied.
The output schema consists of a single attribute sum. Class Function in file Function.h implements
the functions that can appear in the SELECT clause. Method GrowFromParseTree (FuncOperator*
parseTree, Schema& mySchema) creates a Function object from the representation produced by the
• GroupBy (Schema& schemaIn, Schema& schemaOut, OrderMaker& groupingAtts, Function&
compute, RelationalOp* producer) implements the GROUP BY AGGREGATE operator. The output
Project 2 2
schema contains the aggregate attribute sum on the first position, followed by the grouping attributes.
sum is computed based on Function. Class OrderMaker in file Comparison.h defines the grouping
attributes. OrderMaker objects can be built from a schema and the positions of the attributes to
group on (see the constructor OrderMaker(Schema& schema, int* atts, int atts no)).
• WriteOut (Schema& schema, string& outFile, RelationalOp* producer) is an operator for
printing the output to a file, passed as parameter. It has to know the schema of the records and
the output file (which can be an arbitrary file). Class Record contains a method print that you are
supposed to use.
How to build the query execution tree? We follow a bottom-up approach, starting from the Scan
operators and going up to the root of the tree. Class QueryExecutionTree in file RelOp.h is in charge of
managing the query execution tree. This is done exclusively through the root of the tree. A Scan operator
is created for every table in the FROM clause. Next, a Select operator is created for every Scan operator
for which there exists a non-empty CNF, i.e., a selection predicate. This procedure is known as push-down
selection. Next comes the most difficult part: determining the order in which to join the tables. This is the
job of the query optimizer, discussed in a separate section. Once the order is determined, the corresponding
Join operators are created and the tree is completed. Based on the type of the query, multiple paths can
be taken from here:
• If the query is a standard SELECT-FROM-WHERE, a Project operator is appended at the root. In the
case of DISTINCT, a DuplicateRemoval operator is further inserted at the root.
• If the query is an aggregate query, a Sum operator is inserted at the root.
• If the query is a standard SELECT-FROM-WHERE-GROUPBY, a GroupBy operator is appended at the root.
The final step is to append a WriteOut operator at the root of the query execution tree. All our execution
trees will have a WriteOut at the root.
The query optimizer is in charge to determine the optimal query execution plan. There are two aspects it
has to consider: the join order and the specific implementation for each operator. In our case, we consider
only the join order, since we have not discussed about possible operator implementations yet. The first
thing to settle upon when we talk about optimality is the cost function. The standard metric used in query
optimization is the total size of the intermediate relations produced during query execution. Remember that
every relational operator takes as input one or more tables, i.e., relations, and generates as output another
table. We simplify the process and define as our cost function the total number of tuples produced
by the intermediate operators in a query execution tree. The job of the optimizer is then to find
that query execution tree that processes the least number of tuples, without actually executing the query.
The optimizer only estimates this number for all the considered plans and chooses the query plan with the
minimal cost. It uses statistics on table and attribute cardinality and number of distinct elements in order
to compute the estimates. These statistics are stored in the database catalog and updated by a separate
statistics maintenance process executed at certain time intervals. The details of how these statistics are
propagated through the different relational operators are discussed in class and presented in detail in the
textbook (Section 16.4). You can use them as given or come up with your own formulae.
How to compute the optimal join order? The query optimizer considers all possible trees that can
be generated for a given number of tables. The number of tuples in these tables is obtained by applying
the eventual selection predicates to the corresponding base tables. Starting with a node for each table, the
optimal join order is computed by considering all possible joins of 2, 3, etc. tables. The cost of a join of k
Project 2 3
tables can be easily computed from the cost of joining k − 1 tables. If only the best cost for joining a set of
k tables is maintained, a dynamic programming algorithm can be devised. Such an algorithm is presented
in Section 16.6.4 (and thoroughly discussed in class). Your job is to implement it. Once the optimal join
order is computed, you create an instance of struct OptimizationTree defined in file QueryOptimizer.h
and return it to the query compiler, which will then generate the corresponding Join operators.
• You have to get familiarized with all the code provided for this project phase. This includes the lexer
and parser, relational algebra operators, and all their components.
• You have to implement all the methods in file RelOp.cc. They include the constructors/destructors
for all the relational algebra operators and the print methods. You also have to implement the
operator<< in class QueryExecutionTree for printing a complete query execution tree. This calls the
print method for every operator.
• You have to implement all the methods in file QueryCompiler.cc and QueryOptimizer.cc, respectively.
• File main.cc already implements the basic functionality of an application that uses the query compiler.
It reads a SQL query and generates the optimal query execution tree. It does not do anything at this
time because the components are not implemented yet. That is your job. The prompt expects you to
input a query in the SQL subset supported by the parser. Finish the query with CTRL+D. Alternatively,
you can redirect the query from an input file with <. The parser checks the lexical and syntactic
correctness of the input query and prints a message on the screen. Try this executable multiple times
to familiarize yourself with the syntax of the supported query language. You are required to extend
main.cc such that it does not finish after a single query. It should work as follows. It asks the user
for a query. When a query is input, the compiler is called on it and the optimal query execution tree
is printed. Then, the prompt is returned to the user. A new query can be inserted. If the user inputs
“exit”, the program finishes.
• File catalog.txt contains the statistics to be used in the query optimizer. These values are from a
TPC-H instance scale 1 (1 GB). Add them to your SQLite database catalog.
• Folder Queries contains a series of 10 SQL queries to test your compiler/optimizer. They are syntactically correct.
• Required packages: flex, flex-doc, bison, bison-doc.
• Thinking in C++ : http://mindview.net/Books/DownloadSites
Project 2 4