# Assignment 7 Testing Effectiveness of Query Optimization; Object-Relational Database Programming Key-Value Databases and Graph Databases (Draft) solved

\$35.00

## Description

5/5 - (1 vote)

This assignment focuses on problems related to Lecture 9 and Lectures 18
through 22.
• Lecture 18: Algorithms for RA operations
• Lecture 19: Query processing and query plans
• Lecture 20: Object-relational database programming
• Lecture 20: Key-value stores. NoSQL in MapReduce style
• Lecture 21: Key-value stores; NoSQL in Spark style
• Lecture 22: Graph databases
Other lectures that a relevant for this assignment are Lectures 8, 13, and 14:
• Lecture 8: Translating Pure SQL queries into RA expressions
• Lecture 9: Query optimization
• Lecture 13: Object-Relational databases and queries
• Lecture 14: Nested Relational, Semi-structured Databases, Document
Databases
This assignment has problems that are required to be solved. Others, identified as such, are practice problems that you should attempt since the serve as
preparation for the final exam.
Turn in a single assignment7.sql file that contains the PostgreSQL code of
the solutions for the problem that require such code. (Do not include solutions
for the practice problems.) Also turn in a assignment7.txt file that contains
all the output associated with the problems in this assignment. For all the other
problems, submit a single assignment7.pdf file with your solutions.
1
1 Analysis of Queries Using Query Plans
Consider Lecture 19 on Query Processing: Query Planning in (Object) Relational Systems. Consider the analysis, using query plans, for the SOME quantifier.
1. Assume the relation schemas P(x), Q(x), R(x, y) and S(x, z).
Consider the NOT ALL generalized query
{(p.x, q.x)|P(p) ∧ Q(p) ∧ R(p.x) 6⊃ S(p.x)}
where
R(p.x) = {r.y | R(r) ∧ r.x = p.x}
S(q.x) = {s.z | S(s) ∧ s.x = q.x}
Consider Lecture 19 on Query Processing: Query Planning in (Object)
Relational Systems and in particular the analysis, using query plans, for
the SOME generalized quantifier.
Now to the problem. In analogy with the analysis for the SOME generalized
quantifier, do an analysis for the NOT ALL generalized quantifier.
2 Experiments to Test the Effectiveness of Query
Optimization
In the following problems, you will conduct experiments in PostgreSQL to gain
insight into whether or not query optimization can be effective. In other words,
can it be determined experimentally if optimizing an SQL or an RA expression
improves the time (and space) complexity of query evaluation? Additionally,
can it be determined if the PostgreSQL query optimizer attains the same (i.e.,
better or worse) optimization as optimization by hand. Recall that in SQL you
can specify each RA expression as an RA SQL query. This implies that each of
the optimization rules for RA can be applied directly to queries formulated in
RA SQL.
In the following problems you will need to generate artificial data of increasing size and measure the time of evaluating non-optimized and optimized
queries. The size of this data can be in the ten or hundreds of thousands of
tuples. This is necessary because on very small data is it is not possible to gain
sufficient insights into the quality (or lack of quality) of optimization. You can
use the data generation functions that were developed in Assignment 6. Additionally, you are advised to examine the query plans generated by PostgreSQL.
For the problems in this assignments, we will use three relations:1
P(a int)
R(a int, b int)
S(b int)
1A typical case could be where P is Person, R is Knows, and S is the set of persons with the
Databases skill. Another case could where P is the set of persons who work for Amazon, R is
personSkill and S is the set of skills of persons who live in Bloomington. Etc.
2
To generate P or S, you should use the function SetOfIntegers which generate a set of up to n randomly selected integers in the range [l, u]:
create or replace function SetOfIntegers(n int, l int, u int)
returns table (x int) as
$$select floor(random() * (u-l+1) + l)::int as x from generate_series(1,n) group by (x) order by 1;$$ language sql;
To generate R, you should use the function BinaryRelationOverIntegers
which generates up to n randomly selected pairs with first components in the
range [l1, u1] and second components in the range [l2, u2]:
create or replace function BinaryRelationOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int)
returns table (x int, y int) as
$$select floor(random() * (u_1-l_1+1) + l_1)::int as x, floor(random() * (u_2-l_2+1) + l_2)::int as y from generate_series(1,n) group by (x,y) order by 1,2;$$ language sql;
Example 1 Consider the query Q1
select distinct r1.a
from R r1, R r2
where r1.b = r2.a;
This query can be translated and optimized to the query Q2
select distinct r1.a
from R r1 natural join (select distinct r2.a as b from R r2) r2;
Image that you have generated a relation R. Then when you execute
explain analyze
select distinct r1.a
from R r1, R r2
where r1.b = r2.a;
the system will return its query plan as well as the execution time to evaluate
Q1 measured in ms. And, when you execute
explain analyze
select distinct r1.a
from R r1 natural join (select distinct r2.a as b from R r2) r2;
the system will return its query plan as well as the execution time to evaluate
Q2 measured in ms. This permits us to compare the non-optimized query Q1
3
with the optimized query Q2 for various differently-sized relations R. Here are
some of these comparisons for various differently-sized random relations R. In
this table, R was generated with lower and upper bounds l1 = l2 = 1000 and
u1 = u2 = 1000.
2
R Q1 (in ms) Q2 (in ms)
104 27.03 7.80
105 3176.53 58.36
106 69251.58 400.54
Notice the significant difference between the execution times of the non-optimized
query Q1 and the optimized query Q2. So clearly, optimization works on query
Q1.
Incidentally, below are the query plans for Q1 and Q2. Examining these
query plans should reveal why Q1 runs much slower than Q2. (Why?)
QUERY PLAN for Q1
————————————
HashAggregate
Group Key: r1.a
-> Hash Join
Hash Cond: (r1.b = r2.a)
-> Seq Scan on r r1
-> Hash
-> Seq Scan on r r2
QUERY PLAN for query Q2
——————————————
HashAggregate
Group Key: r1.a
-> Hash Join
Hash Cond: (r1.b = r2.a)
-> Seq Scan on r r1
-> Hash
-> HashAggregate
Group Key: r2.a
-> Seq Scan on r r2
2All the experiments where done on a MacMini.
4
We now turn to the problems for this section.
2. Consider query Q3
select distinct p.a
from P p, R r1, R r2, R r3, S s
where p.a = r1.a and r1.b = r2.a and r2.b = r3.a and r.b = S.b;
Intuitively, if we view R as a graph, and P and S as node types (properties),
then Q3 determines each P-node in the graph from which there emanates
a path of length 3 that ends at a S-node.3
I.e., a P-node n0 is in the
answer if there exists sequence of nodes (n0, n1, n2, n3) such that (n0, n1),
(n1, n2), and (n2, n3) are edges in R and n3 is a S-node.
(a) Translate and optimize this query and call it Q4. Then write Q4 as
an RA SQL query just as was done for query Q2 in Example 1.
(b) Compare queries Q3 and Q4 in a similar way as we did for Q1 and
Q2 in Example 1.
You should experiment with different sizes for R. Incidentally, these
relations do not need to use the same parameters as those shown in
the above table for Q1 and Q2 in Example 1.
(c) What conclusions do you draw from the results of these experiments
regarding the effectiveness of query optimization in PostgreSQL and/or
by hand?
3Such a query is typical in Graph Databases.
5
3. Consider the Pure SQL Q5 which is an formulation of a variation of the
not subset (not only) set semijoin query
{p.a | P(p) ∧ R(p.a) 6⊆ S}
where
R(p.a) = {r.b | R(r) ∧ r.a = p.a}.
select p.a
from P p
where exists (select 1
from R r
where r.a = p.a and
not exists (select 1 from S s where r.b = s.b));
(a) Translate and optimize this query and call it Q6. Then write Q6 as
an RA SQL query just as was done for Q2 in Example 1.
(b) An alternative way to write a query equivalent with Q5 is as the
object-relational query
with nestedR as (select P.a, array_agg(R.b) as bs
from P natural join R
group by (P.a)),
Ss as (select array(select b from S) as bs)
select a
from nestedR
select p.value->’a’ as a, p.value->’b’ as b, p.value->’c’ as c
from encodingofR p;
a | b | c
—–+—+—
1 | 2 | 3
4 | 5 | 6
1 | 2 | 4
An important aspect of this encoding strategy is that it is possible to put
multiple relations, possible with different schemas and arities, into the same
key-value store. Besides R, let us also consider a binary relation S(a,d).
create table S (a int, d int);
insert into S values (1,2), (5,6), (2,1), (2,3);
table S;
8PostgreSQL support both json and jsonb objects. For this assignment, you should use the
jsonb object type since it comes with more functionality and offers more efficient computation.
9Notice that this strategy works in general for any relation, independent of the number of
attributes of the relation.
16
a | d
—–+
1 | 2
5 | 6
2 | 1
2 | 3
(4 rows)
We can now encode both R and S into a single key-value store encodingofRandS
as follows:
create table encodingofRandS(key text, value jsonb);
insert into encodingofRandS
select ’R’ as key, jsonb_build_object(’a’, r.a, ’b’, r.b, ’c’, r.c) as value
from R r
union
select ’S’ as key, jsonb_build_object(’a’, s.a, ’d’, s.d) as value
from S s
order by 1, 2;
table encodingofRandS;
key | value
—–+————————–
R | {“a”: 1, “b”: 2, “c”: 3}
R | {“a”: 1, “b”: 2, “c”: 4}
R | {“a”: 4, “b”: 5, “c”: 6}
S | {“a”: 1, “d”: 2}
S | {“a”: 2, “d”: 1}
S | {“a”: 2, “d”: 3}
S | {“a”: 5, “d”: 6}
(7 rows)
Furthermore, we can decode this key-value store using 2 object-relational SQL
queries and recover R and S.
select p.value->’a’ as a, p.value->’b’ as b, p.value->’c’ as c
from encodingofRandS p
where p.key = ’R’;
a | b | c
—–+—+—
1 | 2 | 3
4 | 5 | 6
1 | 2 | 4
(3 rows)
select p.value->’a’ as a, p.value->’d’ as d
from encodingofRandS p
where p.key = ’S’;
a | d
—–+
1 | 2
5 | 6
2 | 1
2 | 3
(4 rows)
17
Example 2 Consider the following problem. Write, in PostgreSQL, a basic
MapReduce program, i.e., a mapper function and a reducer function, as well as
a 3-phases simulation that implements the set intersection of two unary relations
R(a) and S(a), i.e., the relation R ∩ S. You can assume that the domain of the
attribute ‘a’ is integer.
— EncodingOfRandS;
drop table R; drop table S;
create table R(a int);
insert into R values (1),(2),(3),(4);
create table S(a int);
insert into S values (2),(4),(5);
drop table EncodingOfRandS;
create table EncodingOfRandS(key text, value jsonb);
insert into EncodingOfRandS
select ’R’ as key, jsonb_build_object(’a’, r.a) as value
from R r
union
select ’S’ as key, jsonb_build_object(’a’, s.a) as value
from S s
order by 1;
table EncodingOfRandS;
key | value
—–+———-
R | {“a”: 1}
R | {“a”: 4}
R | {“a”: 2}
R | {“a”: 3}
S | {“a”: 4}
S | {“a”: 5}
S | {“a”: 2}
(7 rows)
— mapper function
CREATE OR REPLACE FUNCTION mapper(key text, value jsonb)
RETURNS TABLE(key jsonb, value text) AS
$$SELECT value, key;$$ LANGUAGE SQL;
— reducer function
CREATE OR REPLACE FUNCTION reducer(key jsonb, valuesArray text[])
RETURNS TABLE(key text, value jsonb) AS
$$SELECT ’R intersect S’::text, key WHERE ARRAY[’R’,’S’] <@ valuesArray;$$ LANGUAGE SQL; -- 3-phases simulation of MapReduce Program followed by a decoding step WITH Map_Phase AS ( SELECT m.key, m.value FROM encodingOfRandS, LATERAL(SELECT key, value FROM mapper(key, value)) m 18 ), Group_Phase AS ( SELECT key, array_agg(value) as value FROM Map_Phase GROUP BY (key) ), Reduce_Phase AS ( SELECT r.key, r.value FROM Group_Phase, LATERAL(SELECT key, value FROM reducer(key, value)) r ) SELECT p.value->’a’ as a FROM Reduce_Phase p
order by 1;
a

2
4
(2 rows)
We now turn to the problems for this section.
13. Practice problem–not graded. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well
as a 3-phases simulation that implements the symmetric difference of two
unary relations R(a) and S(a), i.e., the relation (R−S)∪(S−R). You can
assume that the domain of the attribute ‘a’ is integer.
14. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the semijoin of two relations R(A,B) and S(A,B,C), i.e., the relation
R n S. You can assume that the domains of A, B, and C are integer.
Use the encoding and decoding methods described above.
15. Practice problem–not graded. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a
3-phases simulation that implements the natural join R ./ S of two relations R(A, B) and S(B,C). You can assume that the domains of A, B, and
C are integer. Use the encoding and decoding methods described above.
16. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the SQL query
SELECT r.A, array_agg(r.B), sum(r.B)
FROM R r
GROUP BY (r.A)
HAVING COUNT(r.B) < 3; Here R is a relation with schema (A, B). You can assume that the domains of A and B are integers. Use the encoding and decoding methods described above. 19 We now turn to some problems that relate to query processing in Spark. Note that in Spark it is possible to operate on multiple key-value stores. 17. Let R(K, V ) and S(K, W) be two binary key-value pair relations. You can assume that the domains of K, V , and W are integers. Consider the cogroup transformation R.cogroup(S) introduced in the lecture on Spark. (a) Define a PostgreSQL view coGroup that computes a complex-object relation that represent the co-group transformation R.cogroup(S). Show that this view works. (b) Write a PostgreSQL query that use this coGroup view to compute the semi join R n S, in other words compute the relation R ./ πK(S). (c) Write a PostgreSQL query that uses this coGroup view to implement the SQL query SELECT distinct r.K as rK, s.K as sK FROM R r, S s WHERE NOT ARRAY(SELECT r1.V FROM R r1 WHERE r1.K = r.K) && ARRAY(SELECT s1.W FROM S s1 WHERE s1.K = s.K); 18. Practice problem–not graded. Let A(x) and B(x) be the schemas to represent two set of integers A and B. Consider the cogroup transformation introduced in the lecture on Spark. Using an approach analogous to the one in Problem 17 solve the following problems:10 (a) Write a PostgreSQL query that uses the cogroup transformation to compute A ∩ B. (b) Write a PostgreSQL query that uses the cogroup operator to compute the symmetric difference of A and B, i.e., the expression (A − B) ∪ (B − A). 10An important aspect of this problem is to represent A and B as a key-value stores. 20 5 Graph query languages Each of the following problems is a practice problem. 19. Practice Problem–not graded. Consider the database schema Person, Company, companyLocation, Knows, jobSkill, and personSkill. (a) Specify an Entity-Relationship Diagram that models this database schema. (b) Specify the node and relationship types of a Property Graph for this database schema. In addition, specify the properties, if any, associated with each such type. 20. Practice Problem–not graded. Using the Property Graph model in Problem 19b, formulate the following queries in the Cypher query language: (a) Find the types of the relationships associated with Person nodes. (b) Find each person (node) whose name is ‘John’ and has a salary that is at least 50000. (c) Find each jobSkill (node) that is the job skill of a person who knows a person who works for ’Amazon’ and who has a salary that is at least 50000. (d) Find each person (node) who knows directly or indirectly (i.e., recursively) another person who works for Amazon. (e) Find for each company node, that node along with the number of persons who work for that company and who have both the Databases and Networks job skills. 21