CS2040S: Data Structures and Algorithms PS7 solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

Data Analysis Speed Demon
Overview
We live in a world overwhelmed with information. Every two days, we create as much new information as was created in the entire history of human civilization up until 2003 (according to Eric
Schmidt, the CEO of Google). We produce more than 3 exabytes of data per day! Much of this
data is stored in large databases, and one of the challenges today is to rapidly process and analyze
all the data. Since the databases are so large, it requires very fast algorithms and data structures
that are highly optimized for maximum efficiency. In this competition, you will try to develop the
fastest algorithm for analyzing a large dataset.
When analyzing a large dataset, there are many different goals. We will focus on a particular
type of data mining in which we want to discover properties and patterns of the underlying data.
Frequently, we want to know various statistics: the average, the median and the mode. Often, we
also want to know about patterns: how often do users of a certain profile (e.g., males between the
ages of 18 and 32) buy a certain item? Often, we want to know about correlations: how often does
a user who buys item A click on link B?
For the purpose of this competition, we define an abstract data mining problem that involves
finding correlations in our data set. The database consists of a large set of very large data entries.
The goal is to find how many pairs of entries are identical, i.e., contain the same information. Your
job is to implement a data mining program that reads in the database and performs this analysis
as fast as possible.
Problem Details
We now describe the details of the competition more precisely.
Input. The input “database” is a file consisting of a set of lines of text, each of which represents
one entry. Each line of text contains a large number of characters. (Notice that the lines may
consist of thousands, or even tens of thousands, of characters.)
Your program will implement a single public method: public int processData(String filename).
This method will take a filename as an input. It should then parse the file and process the data in
the file. Finally, it should return an integer as the answer. (We have given you an existing class
file template to use.)
The format of the input is as follows. The first line of the database contains a single integer i,
which represents the number of entries in the database. It is followed by i lines, each containing an
arbitrary number of characters and ended by an end-of-line character (ASCII character 10). Note
that entries may consist of any of the 128 legal ASCII characters, except for 10 and 13 (which
indicate a new line). Characters may be repeated, and entries may be of any length.
1
Output. Your program should calculate the number of pairs of entries that contain an identical
multiset (i.e. multiple instances of the same character are allowed) of characters. Notice that the
characters may appear in any order: two entries e1 and e2 are equivalent if e1 is equal to some
permutation of e2. You should print your output, which should consist of an integer, followed by a
newline character.
Example. The following is an example of an input database:
7
BCDEFGH
ABACD
BDCEF
BDCAA
DBACA
DABACA
DABAC
The appropriate output in this case is:
6
In particular, note the following six pairs of equivalent entries:
(ABACD, BDCAA)
(ABACD, DBACA)
(ABACD, DABAC)
(BDCAA, DBACA)
(BDCAA, DABAC)
(DBACA, DABAC)
Rules
The following are the rules of the competition:
• Your solution must be written in Java.
• You may use any available Java libraries. (I should note that I do not expect the fastest
solutions to rely heavily on existing libraries.)
• You are allowed to submit only one final program.
• All programs must be written entirely by you. You may use any ideas or algorithms that you
find on the internet, in books, etc. However, all the submitted code must be written by you.
• The competition will end on Friday, April 3 at 11:59pm.
• You may continue to update your solution up until the deadline. The final winner will be
determined by testing that occurs after the competition ends.
2
Submission Details
Here are some instructions for submitting the code:
• You need to submit your solution on Coursemology as usual.
• If you want to qualify for the competition, you will need to separately submit your solution
on a different platform. More details about this will be given later via an announcement.
• The top 10 winners of the competition will get bonus marks.
Hints
A few hints toward achieving good performance:
• First, develop and test a working solution that achieves good asymptotic performance. Then
improve it.
• Think about the performance of the data structures you are using and the actual costs of the
operations involved.
• Remember that for large databases, memory usage is important. Maintaining big data structures that use a lot of memory can be slow.
• Think about data locality in memory: accessing elements in memory that are near to each
other is much cheaper than accessing elements that are far away from each other.
• Beware of the small costs that add up. For example, declaring a variable leads to a memory
allocation which has a cost.
• Beware the costs of recursion.
• Profile your solution to determine what to optimize.
3
Java Help
How do I read in a file?
There are many different ways (and some may be faster than others). The easiest, perhaps, is to
use the BufferedReader and FileReader classes:
import java.io.BufferedReader;
import java.io.FileReader;
The entire access of the file needs to be wrapped in a try/catch loop in order to catch any exceptions
that may occur while reading the file:
try {
// Code for accessing the ile goes here
} catch (Exception e) {
System.out.println(e);
}
The first thing you need to do is to open the file using the FileReader:
FileReader dataFile = new FileReader(fileName);
You then access the file via a BufferedReader:
BufferedReader bufferedDataFile = new BufferedReader(dataFile);
You can then read the file:
String line = bufferedDataFile.readline();
For more information on the BufferedReader and FileReader (and other file access mechanisms),
see the Oracle Java Reference:
https://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html
How do I measure how fast my program runs?
We have included several sample databases to test your program on. And remember the StopWatch
class you used back in the Sorting Detectives Problem Set? That might be useful here too.
4