## Description

For this assignment, you will need the material covered in the lectures

• Lecture 15: External Merge Sorting

• Lecture 16: Indexing

• Lecture 17: B+ trees and Hashing

In addition, you should read the following sections in Chapter 14 and 15 in the

textbook Database Systems The Complete Book by Garcia-Molina, Ullman, and

Widom:

• Section 14.1: Index-structure Basics

• Section 14.2: B-Trees

• Section 14.3: Hash Tables

• Section 14.7: Bitmap Indexes

• Section 15.8.1: Multipass Sort-Based Algorithms

In the file performingExperiments.sql supplied with this assignment, we

have include several PostgreSQL functions that should be useful for running

experiments. Of course, you may wish to write your own functions and/or

adopt the functions in this .sql to suite the required experiments for the various

problems in this assignment.

The problems that need to be included in the assignment6.sql are marked

with a blue bullet •. The problems that need to be included in the assignment6.pdf

are marked with a red bullet •. (You should aim to use Latex to construct your

.pdf file.) In addition, submit a file assignment6Code.sql that contains all the

sql code that you developed for this assignment.

Practice problems should not be submitted.

1

1 Data generation

PostgreSQL functions and clauses In this assignment there will be a need

to do simulations with various-size relations. Many of these relations will have

randomized data. PostgreSQL provides useful functions and clauses that make

such relations:

random() returns a random real number between 0 and 1 using the uniform distribution

floor(random() ∗ (u − l + 1) + l) :: int returns a random integer in the range [l, u] using the uniform distribution

generate series(m, n) generates the set of integers in the range [m, n]

generate series(m, n, k) generates the set of integers in the range [m, n]) separated by step-size k

order by random() randomly orders the tuples that are the result of a query

limit(n) returns the first n tuples from the result of a query

offset(n) returns all but the first n tuples from the result of a query

limit(m) offset(n) returns the first m tuples from all but the first n tuples from the result of a query

row number() is a window function that assigns a sequential integer to each row in a result set

vacuum is a garbage collection function to clean and reclaim secondary memory space

For more detail, consult the manual pages

https://www.postgresql.org/docs/13/functions-math.html

https://www.postgresql.org/docs/current/functions-srf.html

https://www.postgresql.org/docs/current/queries-limit.html

https://www.postgresql.org/docs/8.4/functions-window.html

https://www.postgresql.org/docs/9.5/routine-vacuuming.html

In the file performingExperiments.sql supplied with this assignment, we

have include several PostgreSQL functions that should be useful for running

experiments. Of course, you may wish to write your own functions and/or

adopt the functions in this .sql to suite the required experiments for the various

problems in this assignment.

Generating sets To generate a set, i.e., a unary relation, of n randomly

selected integers in the range [l, u], you can use the following function:1

create or replace function SetOfIntegers(n int, l int, u int)

returns table (x int) as

$$

select floor(random() * (u-l+1) + l)::int from generate_series(1,n);

$$ language sql;

Example 1 To generate a unary relation with 3 randomly selected integers in

the range 5 to 10, do the following:

select x from SetofIntegers(3,5,10);

Of course, running this query multiple times, result in different sets.

1Typically the function SetOfIntegers returns a bag (multiset) but this is fine for this

assignment. In case we want a set, we can always eliminate duplicates.

2

Generating binary relations The idea behind generating a set can be generalized to that for the generation of a binary relation.2 To generate a binary

relation of n randomly selected pairs of integers (x, y) such x ∈ [l1, u1] and

y ∈ [l2, u2], you can use the following function:

create or replace function

BinaryRelationOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int)

returns table (x int, y int) as

$$

select floor(random() * (u_1-l_1+1) + l_1)::int as x,

floor(random() * (u_2-l_2+1) + l_2)::int as y

from generate_series(1,n);

$$ language sql;

Example 2 To generate a binary relation with 20 randomly selected pairs with

first components in the range [3, 8] and second components in the range [2, 11],

do the following:

select x, y from BinaryRelationOverIntegers(20,3,8,2,11);

Generating functions A relation generated by BinaryRelationOverIntegers

is in general not a function since it is possible that the relation has pairs (x, y1)

and (x, y2) with y1 6= y2. To create a (partial) function f : [l1, u1] → [l2, u2] of

n randomly selected function pairs, we can use the following function:

create or replace function

FunctionOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int)

returns table (x int, y int) as

$$

select x, floor(random() * (u_2-l_2+1) + l_2)::int as y

from generate_series(l_1,u_1) x order by random() limit(n);

$$ language sql;

Example 3 To generate a partial function [1, 20] → [3, 8] of 15 randomly selection function pairs, do the following:3

select x, y from FunctionOverIntegers(15,1,20,3,8);

2Clearly, all of this can be generalized to higher-arity relations.

3When using this function, it is customary to use n such that n ∈ [0, u1 − l1 + 1].

3

Generating relations with categorical (non-numeric) data Thus far,

the sets, binary relations, and functions have all just involved integer ranges.

It is possible to include ranges that have different types including categorical

data such as text strings. The technique to accomplish this is to first associate

with a categorical range an integer range that associate with each element in

the categorical range a unique value of the integer range. The next example

illustrates this.

Example 4 Consider the jobSkill relation and assume that it contents is

skill

AI

Databases

Networks

OperatingSystems

Programming

Suppose that we want to generate a personSkill(pid, skill) relation. Let

us assume that the pid’s are integers in the range [1, m].

There are 5 skills in the jobSkill and it is therefore natural to associate

with each skill a separate integer (index value) in the range [1, 5]. This can be

done with a query involving the row number() window function:

select skill, row number() over (order by skill) as index

from Skill;

The result is the following relation:

skill index

AI 1

Databases 2

Networks 3

OperatingSystems 4

Programming 5

Using this technique, we can write a PostgreSQL function that generates a

personSkill relation with n randomly selected (pid, skill) tuples, with pid’s

in the range [l, u]:

create or replace function GeneratepersonSkillRelation(n int, l int, u int)

returns table (pid int, skill text) as

$$

with skillNumber(skill, index) as (select skill, row_number() over (order by skill)

from Skill),

pS as (select x, y

from BinaryRelationOverIntegers(n,l,u,1, (select count(1) from Skill)::int))

select x as pid, skill

from pS join skillNumber on y = index

group by (x, skill) order by 1,2;

$$ language sql;

4

In this function, the skillNumber view associates with each job skill an

integer index in the range [1, |Skill|]. The pS view is a randomly generated

binary relation with n tuples, with pid’s in the range [l, u], and skill numbers

in the range [1, |Skill|]. The join operation associates the numeric range with

the Skill range. The ‘group by (x, skill) order by 1,2’ clause eliminates

duplicate tuples and orders the result.

The query

select * from GeneratepersonSkillRelation(10,1,15);

may make the personSkill relation:

pid skill

1 AI

2 Programming

3 Databases

4 Databases

6 Networks

6 OperatingSystems

6 Programming

9 Databases

14 Databases

14 Networks

Problems We now turn to the problems in this section.

1. • Given a discrete probability mass function P, as specified in a relation

P(outcome: int, probability: float), over a range of possible outcomes [u2, l2], design a PostgreSQL function

RelationOverProbabilityFunction(n, l1, u1, l2, u2)

that generates a relation of up to n pairs (x, y) such that

• x is uniformly selected in the range [l1, u1]; and

• y is selected in accordance with the probability mass function P in

the range [l2, u2].

An example of a possible P as stored in relation P is as follows:4

P

outcome probability

1 0.25

2 0.10

3 0.40

4 0.10

5 0.15

4Notice that the sum of the probabilities in the probability column in P sum to 1.

5

Note that when P is the uniform probability mass function, then

RelationOverProbabilityFunction

and

BinaryRelationOverIntegers

are the same binary relation producing functions.

Hint: For insight into this problem, consult the method of Inverse Transform Sampling for discrete probability mass functions.

Test your function for the cases listed in the Assignment-Script-2021-Fall-assignment6.sql

file.

2. Practice Problem-not graded.

Use the technique in Problem 1 and the method for generating categorical

data discussed above to write a PostgreSQL function that generates a

personSkill relation, given a probability mass function over the Skill

relation.

Your function should work for any Skill relation and any probability

distribution defined over it.

Test your function for the cases listed in the Assignment-Script-2021-Fall-assignment6.sql

file.

6

2 Sorting

We have learned about external sorting. The problems in this section are designed to look into this sorting method as it implemented in PostgreSQL.

3. • Create successively larger sets of n randomly selected integers in the

range [1, n]. You can do this using the following function and with the

following experiment.5

create or replace function makeS (n integer)

returns void as

$$

begin

drop table if exists S;

create table S (x int);

insert into S select * from SetOfIntegers(n,1,n);

end;

$$ language plpgsql;

This function generates a bag S of size n, with randomly select integers

in the range [1, n]. Now consider the following SQL statements:

select makeS(10);

explain analyze select x from S;

explain analyze select x from S order by 1;

• The ‘select makeS(10)’ statement makes a bag S with 10 elements;

• The ‘explain analyze select x from S’ statement provides the

query plan and execution time in milliseconds (ms) for a simple scan

of S;

• The ‘explain analyze select x from S order by 1’ statement provides the query plan and execution time in milliseconds (ms) for sorting S.

6

QUERY PLAN

——————————————————————————————————

Sort (cost=179.78..186.16 rows=2550 width=4) (actual time=0.025..0.026 rows=10 loops=1)

Sort Key: x

Sort Method: quicksort Memory: 25kB

-> Seq Scan on s (cost=0.00..35.50 rows=2550 width=4) (actual time=0.004..0.005 rows=10 loops=1)

Planning Time: 0.069 ms

Execution Time: 0.034 ms

Now construct the following timing table:7

5You should make it a habit to use the PostgreSQL vacuum function to perform garbage

collection between experiments.

6Recall that 1ms is 1

1000 second.

7

It is possible that you may not be able to run the experiments for the largest S. If that is

the case, just report the results for the smaller sizes.

7

size n of relation S avg execution time to scan S (in ms) avg execution time to sort S (in ms)

101

102

103

104

105

106

107

108

(a) What are your observations about the query plans for the scanning

and sorting of such differently sized bags S?

(b) What do you observe about the execution time to sort S as a function

of n?

(c) Does this conform with the formal time complexity of (external) sorting? Explain.

(d) It is possible to set the working memory of PostgreSQL using the set

work mem command. For example, set work mem = ’16MB’ sets the

working memory to 16MB.8 The smallest possible working memory

in postgreSQL is 64kB and the largest depends on you computer

memory size. But you can try for example 1GB.

Repeat question 3a for memory sizes 64kB and 1GB and report your

observations.

(e) Now create a relation indexedS(x integer) and create a Btree index on indexedS and insert into indexedS the sorted relation S.

9

create table indexedS (x integer);

create index on indexedS using btree (x);

insert into indexedS select x from S order by 1;

Then run the range query

select x from indexedS where x between 1 and n;

where n denotes the size of S.

Then construct the following table which contains (a) the average

execution time to build the btree index and (2) the average time to

run the range query.

8The default working memory size is 4MB.

9For information about indexes in PostgreSQL consult the manual page

https://www.postgresql.org/docs/13/indexes.html.

8

size n of relation S avg execution time to create index indexedS avg execution time to sort indexedS (in ms)

101

102

103

104

105

106

107

108

What are your observations about the query plans and execution

times to create indexedS and the execution times for sorting the

differently sized bags indexedS? Compare your answer with those

for the above sorting problems.

4. Practice problem-not graded.

Typically, the makeS function returns a bag instead of a set. In the problems in this section, you are to conduct experiments to measure the execution times to eliminate duplicates.

(a) Write a SQL query that uses the DISTINCT clause to eliminate duplicates in S and report you results in a table such as that in Problem 3a.

(b) Write a SQL query that uses the GROUP BY clause to eliminate duplicates in S and report you results in a table such as that in Problem 3a.

(c) Compare and contrast the results you have obtained in problems 4a

and 4b. Again, consider using explain analyze to look at query

plans.

9

3 Indexes and Indexing

Indexes on data (1) permit faster lookup on data items and (2) may speed

up query processing on such data. Speedups can be substantial. The purpose

of the problems in this section are to explore this. Some other problems in

this section are designed to understand the workings of the B

+-tree and the

extensible hashing data structures.

Discussion PostgreSQL permit the creation of a variety of indexes on tables.

We will review such index creation and examine their impact on data lookup

and query processing. For more details, see the PostgreSQL manual:

https://www.postgresql.org/docs/13/indexes.html

Example 5 The following SQL statements create indexes on columns or combinations of columns of the personSkill relation.10 Notice that there are

2

arity(personSkill) − 1 = 22 − 1 = 3

such possible indexes.

create index pid_index on personSkill (pid); — index on pid attribute

create index skill_index on personSkill (skill); — index on skill attribute

create index pid_skill_index on personSkill (pid,skill); — index (pid, skill)

Example 6 It is possible to declare the type of index: btree or hash. When

no index type is specified, the default is btree. If instead of a Btree, a hash

index is desired, then it is necessary to specify a using hash qualifier:

create index pid_hash on personSkill using hash (pid); — hash index on pid attribute

Example 7 It is possible to create an index on a relation based on a scalar

expression or a function defined over the attributes of that relation. Consider

the following (immutable) function which computes the number of skills of a

person:

create or replace function numberOfSkills(p integer) returns integer as

$$

select count(1)::int

from personSkill

where pid = p;

$$ language SQL immutable;

10Incidentally, when a primary key is declared when a relation is created, PostgreSQL will

create a btree index on this key for the relation.

10

Then the following is an index defined on the numberOfSkills values of

persons:

create index numberOfSkills_index on personSkill (numberOfSkills(pid));

Such an index is useful for queries that use this function such as

select pid, skill from personSkill where numberOfSkills(pid) > 2;

We now turn to the problems in this section.

5. • Consider a relation Student(sid text, sname text, major, year).

A tuple (s, n, m, y) is in Student when s is the sid of a student and n, m,

and y are that student’s name, major, and birth year. Further, consider

a relation Enroll(sid text, cno text, grade text). A triple (s, c, g)

is in Enroll when the student with sid s was enrolled in the course with

cname c and obtained a letter grade g in that course.

We are interested in answering queries of the form

select sid, sname, major, byear

from Student

where sid in (select sid

from Enroll sid

where cno = c [and|or|and not] grade = g);

Here c denotes a course name and g denotes a letter grade.

Read Section 14.1.7 ‘Indirection in Secondary Indexes’ in your textbook

Database Systems The Complete Book by Garcia-Molina, Ullman, and

Widom. Of particular interest are (a) the concept of buckets (Figure 14.7)

which are sets of tids and (b) the technique of performing set operations

(like intersections) on relevant buckets (Figure 14.8) to answer queries of

the form as shown above.

The goal of this problem is to use object-relational SQL to simulate these

concepts. To make things more concrete, consider the following Student

and Enroll relations:

Student:

sid | sname | major | byear

——+——–+———+——-

s100 | Eric | CS | 1988

s101 | Nick | Math | 1991

s102 | Chris | Biology | 1977

s103 | Dinska | CS | 1978

s104 | Zanna | Math | 2001

s105 | Vince | CS | 2001

Enroll:

sid | cno | grade

——+——+——-

s100 | c200 | A

s100 | c201 | B

11

s100 | c202 | A

s101 | c200 | B

s101 | c201 | A

s101 | c202 | A

s101 | c301 | C

s101 | c302 | A

s102 | c200 | B

s102 | c202 | A

s102 | c301 | B

s102 | c302 | A

s103 | c201 | A

s104 | c201 | D

Now consider associating a tuple id (tid) with each tuple in Enroll:

tid | sid | cno | grade

—–+——+——+——-

1 | s100 | c200 | A

2 | s100 | c201 | B

3 | s100 | c202 | A

4 | s101 | c200 | B

5 | s101 | c201 | A

6 | s101 | c202 | A

7 | s101 | c301 | C

8 | s101 | c302 | A

9 | s102 | c200 | B

10 | s102 | c202 | A

11 | s102 | c301 | B

12 | s102 | c302 | A

13 | s103 | c201 | A

14 | s104 | c201 | D

Use object-relational SQL to construct two secondary indexes indexOnCno

and indexOnGrade on the Enroll relation. These indexes should be stored

in two separate relations which you can conveniently call indexOnCno and

indexOnGrade, respectively. two object-relational views in a manner that

simulates the situation illustrated in Figure 14.8. In particular, do not

use the ‘create index’ mechanism of SQL to construct these indexes.

Then, using the indexOnCno and indexOnGrade views and the technique of

intersecting buckets, write a function FindStudents(booleanOperation

text, cno text, grade text) that can be used to answer queries of the

form as shown above. (Here the booleanOperation is a string which can

be ’and’, ’or’, or ’and not’.)

For example, the query

select * from FindStudents(’and’, ’c202’, ’A’);

should return the same result as that of the query

select sid, sname, major, byear

from Student

where sid in (select sid

from Enroll sid

where cno = ’c202’ and grade = ’A’);

12

Test your solution for the following cases on the Student and Enroll

relation given for this problem:

(a) select * from FindStudents(’and’, ’c202’, ’A’);

(b) select * from FindStudents(’or’, ’c202’, ’A’);

(c) select * from FindStudents(’and not’, ’c202’, ’A’);

6. Practice problem–not graded. Read Section 14.7 ‘Bitmap Indexes’ in

your textbook Database Systems The Complete Book by Garcia-Molina,

Ullman, and Widom. In particular, look at Example 14.39 for an example

of a bitmap index for a secondary index.

Next, revisit Problem 5. There, we considered two secondary indexes

indexOnCno and indexOnGrade. We will now consider the corresponding

bitmap indexes bitmapIndexOnCno and bitmapIndexOnGrade:

bitmapIndexOnCno

cno | bit-vector

——+—————-

c200 | 10010000100000

c201 | 01001000000011

c202 | 00100100010000

c301 | 00000010001000

c302 | 00000001000100

and

bitmapIndexOnGrade

grade | bit-vector

——-+—————-

A | 10101101010110

B | 01010000101000

C | 00000010000000

D | 00000000000001

Use object-relational SQL to construct two secondary indexes bitmapIndexOnCno

and bitmapIndexOnGrade as two object-relational relations in a manner

that simulates the situation just illustrated above.

Then, using the bitmapIndexOnCno and bitmapIndexOnGrade relations

and the technique of forming the bitmap-AND, bitmap-OR, and bitmap-AND

NOT of two bit-vectors, write a function FindStudents(booleanOperation

text, cno text, grade text) that can be used to answer queries of the

form as shown in Problem 5.

For example, the query

select * from FindStudents(’and’, ’c202’, ’A’);

should return the same result as that of the query

select sid, sname, major, byear

from Student

where sid in (select sid

from Enroll sid

where cno = ’c202’ and grade = ’A’);

13

Test your solution for the following cases on the Student and Enroll

relation given for this problem:

(a) select * from FindStudents(’and’, ’c202’, ’A’);

(b) select * from FindStudents(’or’, ’c202’, ’A’);

(c) select * from FindStudents(’and not’, ’c202’, ’A’);

7. • Consider the following parameters:

block size = 4096 bytes

block-address size = 9 bytes

block access time (I/O operation) = 10 ms (micro seconds)

record size = 150 bytes

record primary key size = 8 bytes

Assume that there is a B+-tree, adhering to these parameters, that indexes

1 billion (109

) records on their primary key values.

Provide answers to the following problems and show the intermediate computations leading to your answers.

(a) Specify (in ms) the minimum time to determine if there is a record

with key k in the B+-tree. (In this problem, you can not assume that

a key value that appears in an non-leaf node has a corresponding

record with that key value.)

(b) Specify (in ms) the maximum time to insert a record with key k in

the B+-tree assuming that this record was not already in the data

file.

(c) How large must main memory be to hold the first two levels of the

B

+-tree? How about the first three levels?

14

8. • Consider the following B+-tree of order 2 that holds records with keys

2, 8, and 11.11

(a) Show the contents of your B+-tree after inserting records with keys

4, 10, 12, 1, 7, and 5 in that order.

Strategy for splitting leaf nodes: when a leaf node n needs to be

split into two nodes n1 and n2 (where n1 will point to n2), then use

the rule that an even number of keys in n is moved into n1 and an

odd number of keys is moved into n2. So if n becomes conceptually

k1|k2|k3 then n1 should be k1|k2 and n2 should be k3| and

n1 → n2.

(b) Starting from your answer in question 8a, show the contents of your

B+-tree after deleting records with keys 12, 2, and 11, in that order.

(c) Starting from your answer in question 8b, show the contents of your

B+-tree after deleting records with keys 5, 1, and 4, in that order.

11Recall that (a) an internal node of a B+-tree of order 2 can have either 1 or 2 keys values,

and 2 or 3 sub-trees, and (b) a leaf node can have either 1 or 2 key values.

15

9. • Consider an extensible hashing data structure wherein (1) the initial

global depth is set at 1 and (2) all directory pointers point to the same

empty bucket which has local depth 0. So the hashing structure looks

like this: Assume that a bucket can hold at most two records.

(a) Show the state of the hash data structure after each of the following

insert sequences:12

i. records with keys 6 and 7.

ii. records with keys 1 and 2.

iii. records with keys 9 and 4.

iv. records with keys 8 and 3.

(b) Starting from the answer you obtained for Question 9a, show the

state of the hash data structure after each of the following delete

sequences:

i. records with keys 9 and 6.

ii. records with keys 0 and 1.

iii. records with keys 1 and 8.

As in the text book, the bit strings are interpreted starting from their

left-most bit and continuing to the right subsequent bits.

12You should interpret the key values as bit strings of length 4. So for example, key value

7 is represented as the bit string 0111 and key value 2 is represented as the bit string 0010.

16

The goal in the next problems is to study the behavior of indexes for various

different sized instances13 of the Person, worksFor, and Knows relations and for

various queries:

10. • Create an appropriate index on the worksFor relation that speedups the

lookup query

select pid from worksFor where cname = c;

Here c is a company name.

Illustrate this speedup by finding the execution times for this query for

various sizes of the worksFor relation.

11. • Create an appropriate index on the worksFor relation that speedups the

range query

select pid, cname

from worksFor

where s1 <= salary and salary <= s2;
Here s1 and s2 are two salaries with s1 < s2.
Illustrate this speedup by finding the execution times for this query for
various sizes of the worksFor relation.
12. • Create indexes on the worksFor relation that speedup the multiple conditions query
select pid
from worksFor
where salary = s and cname = c;
Here s is a salary and c is a company name.
Illustrate this speedup by finding the execution times for this query for
various sizes of the worksFor relation.
13. • Create indexes on the appropriate relations that speedup the semi-join
[anti semi-join] query
select pid, pname
from Person
where pid [not] in (select pid from worksFor where cname = c);
13(Three different sizes, small, medium, large suffice for your experiment.
17
Here c is a company name.
Illustrate this speedup by finding the execution times for this query for
various sizes of the Person and worksFor relations.
14. Practice problem–not graded.
Create indexes that speedup the path-discovery query
select distinct k1.pid1, k3.pid2
from knows k1, knows k2, knows k3
where k1.pid2 = k2.pid1 and k2.pid2 = k3.pid1;
Illustrate this speedup by finding the execution times for this query for
various sizes of the Knows relation.
18