CPrE/SE 419: SOFTWARE TOOLS FOR LARGE-SCALE DATA ANALYSIS Lab 3 solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

Purpose In this lab, your goal is to write a program using MapReduce that can sort a large data set quickly. Sorting is a common task on large data sets, and you will be working on a standard input data set, which is used as a benchmark for comparing different implementations of sort. At the end of the lab, you will be able to write an algorithm and implementation for sorting a large data set, and measure its performance on various input data sets. Constraints: • Use no more than 10 reducers at any MapReduce stage. • In your program, make sure to delete all the temporary directories which contains the intermediate data before exiting. Metric of Success: • The output of your program could be one or more sorted lists. Suppose you have n reducers, e.g. R1, R2, …, Rn, all data in R1 need to be smaller than all data in R2, and all data in R2 need to be smaller than R3 and so on. Submission (50 points) Create a zip (or tar) archive and hand it in through Canvas. Label your submission “lab4_”. Your submission includes: ● A writeup describing: • The results for each experiment including top 5 and last 5 lines for each partition. • The strategy you used for experiment 2. • The arguments for your jar file to run, for each experiment. • Any other resources to support your work. ● Commented Code for your program. Include all source files needed for compilation. Resource • Lecture notes on sorting and partitioners in MR • http://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/mapreduce/Partitioner.html • https://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/mapreduce/lib/partition/Total OrderPartitioner.html Template Code • Chapter 9 MapReduce Features: Total Sort, Hadoop-The Definitive Guide Electrical and Computer Engineering / Software Engineering Dataset We have multiple datasets in the directory “/cpre419/”. Each dataset is made of multiple lines, one line per record. Each record has a key (first 15 chars) and a payload (40 chars), separated by tab character. The final output should be sorted by key, and the output should also include the payload for the key. Example of the input data: Experiment 1 (25 points) Sort the data in “/cpre419/input-50m” by keys, using TotalOrderPartitioner. Use up to 10 reducers. Experiment 2 (25 points) Write your own “MyPartitioner” class and use it to sort the data in “/cpre419/input-5k”. You MUST NOT use InputSampler class. Explain your algorithm/strategy to partition in the submission. Use up to 10 reducers. Bonus We have the following datasets, sorted in size. Try TotalOrderPartitioner and your own partitioner (MyPartitioner) on each dataset. Do you get different results? What makes it different? Which is the largest dataset your solution can sort? You get the bonus points for the largest dataset you can sort by using only up to 10 reducers. • /cpre419/input-5k 0 point – requested • /cpre419/input-50k 3 points Electrical and Computer Engineering / Software Engineering • /cpre419/input-500k 5 points • /cpre419/input-5m 10 points • /cpre419/input-50m 20 points