CSE 4334/5334 – Data Mining Project 3 solved


Category: You will receive a download link of the .ZIP file upon Payment


5/5 - (1 vote)

Project 3 asks you to cluster jobs based on job descriptions and requirements.
You are given jobs.tsv, a tab-separated file which has three columns: JobID, description, and
requirements. Note that this is not the file used in Project1, nor the file used in Project 2. Your task is to
design/implement a clustering algorithm to cluster the given jobs into a number of clusters. You can
determine which method to use and how many clusters to be generated.
More specific tasks include:
1) (10 points) read information from input files.
2) (50 points) implement your clustering program.
3) (20 points) print your cluster assignments to an output file named output.tsv. The output file should
look like the following. An example file sampleoutput.tsv is given to you.
1020868 1
628097 3
… …
284009 3
628097 9
891097 1

Your output file output.tsv must contain exact the same number lines as jobs.tsv, each of which has two
tab-separated fields (JobID, ClusterNo). The ClusterNo is the JobID’s assigned cluster number.
We will use your output.tsv to assess how good your clustering is. Whichever clustering method you
choose to implement, you should always have an evaluation metrics (e.g., SSE) to assess how good
your clustering result is.
Your program must be executed by the following commands, and the arguments must be in the same
java your-main-class-name /path/to/data/file/directory/ /path/to/output.tsv
python your-script-file.py /path/to/data/file/directory/ /path/to/output.tsv
./a.out /path/to/data/file/directory/ /path/to/output.tsv
* /path/to/data/file/directory/ is the path (e.g., /home/john/data-mining/data/) to the directory that has
the input file: jobs.tsv.
* /path/to/output.tsv is the path to the output file, e.g. /home/john/data-mining/output.tsv