INF553 Homework #1: MapReduce & Spark solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

1. [Hadoop MapReduce, 60 points] In this problem, you are asked to write a Hadoop MapReduce
program, TwoPhase.java, that computes the multiplication of two given matrices using the twophase approach described in class.
A template file is provided to you, which contains the skeleton for mapper and reducer of phase one
and two. You only need to supply codes for the block indicated by “// fill in your code” add import
statements at the beginning of the file if necessary (you can only import packages from java.*). DO
NOT modify other parts of the template.
Note that the input matrices A and B are stored in two separate directories, e.g., mat-A and mat-B.
Each directory contains a single text file, each line of which is a tuple (row-index, column-index,
value) that specifies a non-zero value in the matrix.
For example, the following is the content of mat-A/values.txt which stores the entries for the matrix
A shown above.
0,0,2
0,1,2
0,2,1
1,0,2
1,1,1
2,0,1
2,2,2
MultileInputs class is used to specify different mappers for different input directory.
You also need to submit your jar file with name __2p.jar.
Example invocation of your program is as follows:
bin/hadoop jar __2p.jar TwoPhase mat-A
mat-B output
Your output directory (the file part-r-00000) should contain the entries of matrix C (= A * B) in the
following format (tab-separated). For example,
INF 553 – Spring 2017
2
1,1 3
1,2 7
2,1 2
2,2 4
3,1 3
3,2 3
Submissions: Submit both __TwoPhase.java and
__2p.jar. DO NOT make them into folder or zip file.
2. [40 points] Write a Spark program in Python, TwoPhase.py, which implements the same 2-phase
approach as in Problem 1.
Restrictions: you can NOT use join (and any outer join variations) in your code, but may use any
other transformations and actions provided in the lecture. This gives you a chance to think about
how to implement join using other methods in Spark. It will also maximize the benefits of this
exercise. In the end, you will implement a very similar program in Spark as you have done in Hadoop
MapReduce for Problem 1.
Your program should be invoked as follows.
bin/spark-submit __TwoPhase.py matA/values.txt mat-B/values.txt output.txt
It should produce the same output as Problem 1 and store the output in a file “output.txt”.
Submissions: Submit __TwoPhase.py.
Notes:
 In both questions, you may assume that matrix entries are all integers.
 Make sure to follow the output format (The order of output entries doesn’t matter) and the
naming format. If you don’t follow either one or both of them, 20% points will be deducted.
 You MUST implement your program using the two-phase approach described in class, and
you CAN NOT use join transformation in your python program. If you don’t follow these
instructions, 80% points will be deducted.
 Make sure that your program can be invoked correctly. If not, no more than 50% of the
points can be earned.