## Description

Overview

Assume that you are working in the data mining department of a civil registry and you need to

process big data that includes statistical information. Before applying data mining techniques

to the data, you need to apply some preprocessing operations. You are provided a data set

(population_by_zip2010.csv) that includes the information about the residences in 2010 in the

US. In this file, every single line (except the first one, since it is the header) belongs to a

distinct residence. The following information is provided in the data set:

Population: The population for that segment

Minimum age (can be null): Minimum age in the range

Maximum age (can be null): Maximum age in the range

Gender (can be null)

Zip code: Zip code of that residence

Geo_id: Geo code of that residence

The department wants to have this data sorted in this manner: The data should be sorted

considering the population. However, if two residence has the same number of habitants,

geo_id should be considered. If both parts match, the behavior of your algorithm should

decide which residence should appear first.

Code (50 points)

You are required to implement Quicksort algorithm by considering the constraints described

above for this problem. Your program should be executed from the command line as follows

with the following parameters:

./yourExecutableName N

N: number of the residences to be processed (You can just take the first N entries from the

file)

After the execution, a message including the elapsed time should be printed out, and an output

file that includes the sorted data with the same format of the input file should be produced.

Report (50 points)

a. (10 points) Give the asymptotic upper bound on the running time for Quicksort. If there is a

recurrence for the algorithm, give and solve it.

b. (10 points) Run your program for N as {10, 100, 1000, 10000, 100000, 500000, 1000000},

and calculate the average time of execution (for example run 10 times for N = 10 and take the

average, 10 times for N = 100 and take the average, and so on…) for each value of N. After

calculating average execution times, you are required to prepare a two-lined plot in order to

visualize the run time complexity of Quicksort for different values of N. Comment on the

results by considering the asymptotic bound that you have found in (a).

Note: You can use the clock() function under ctime library for calculating time of execution

for the search functions. Refer to http://www.cplusplus.com/reference/clibrary/ctime/clock for

more details.

c. (20 points) What is the worst case for Quicksort? Construct a data set in order to simulate

this case, and execute your program for different values of N as {10, 100, 1000, 10000,

100000, 500000, 1000000}. Report the average execution times, and introduce a plot like you

have done in part (b). Introduce a solution in order to overcome the worst case of Quicksort

(You do not need to implement that solution).

d. (10 points) Is Quicksort stable? Explain your answer by illustrating your answer with a

small fraction from the data set. You may modify the values if necessary.

Submission

You should be aware that the Ninova system clock may not be synchronized with your

computer, watch, or cell phone. Do not e-mail the teaching assistant or the instructors your

submission after the Ninova submission has closed. If you have submitted to Ninova once and

want to make any changes to your report, you should do it before the Ninova submission

system closes. Your changes will not be accepted by e-mail. Connectivity problems to the

Internet or to Ninova in the last few minutes are not valid excuses for being unable to submit.

You should not risk leaving your submission to the last few minutes. After uploading to

Ninova, check to make sure that your submission appears there.

Policy: You may discuss the problem addressed by the project at an abstract level with your

classmates, but you should not share or copy code from your classmates or from the Internet.

You should submit your own, individual project. Plagiarism and any other forms of

cheating will have serious consequences, including failing the course.

Submission Instructions: Please submit your assignment through Ninova. Include both your

report (as a PDF file) and your code (including header files) in the archive file you submit.

All your code must be written in C++, and must compile and run on the common ITU

Linux Server (accessible through SSH) using g++. If your code requires non-standard

compiling, please state compilation instructions in your report.

When you write your code, try to follow an object-oriented methodology with well-chosen

variable, method, and class names and comments where necessary. Your code must compile

without any errors; otherwise, you may get a grade of zero on the coding part of the

assignment.

If a question is not clear, please let the teaching assistant know by email (daltan@itu.edu.tr).