CSE 4334/5334 – Data Mining  Project 2 solved


Category: You will receive a download link of the .ZIP file upon Payment


5/5 - (1 vote)

Project 2 asks you to predict which jobs users will apply to. This will give us a basis for recommending
jobs to career website users. For satisfactory user experience, it is important to only recommend jobs
that interest users.
You are given 5 files:
x jobs.tsv: The same file used in Project 1;
x users.tsv, apps.tsv, user_history.tsv: The schema and format of these files are identical to that of
the files used in Project 1, but they now contain more users and their applications/history
x user2.tsv: A single-column file that contains the UserIDs of a subset of the users in users.tsv.
Conceptually, users are partitioned into 2 mutually-exclusive sets—those in user2.tsv (denoted by U2)
and those not (denoted by U1). Timestamps are partitioned into 2 ranges—before 2012-04-09 00:00:00
(denoted by T1) and after 2012-04-09 00:00:00 (denoted by T2). The file apps.tsv contains all those
applications made by U1 (during both T1 and T2) and all those applications made during times in T1
(by both U1 and U2).
Your task is to predict what are the jobs U2 have applied to during T2. (One thing to remember is that
no one can possibly apply to a job during T2 if that job’s EndDate is before 2012-04-09 00:00:00.)
More specific tasks include:
1) (10 points) read information from input files.
2) (50 points) build your prediction tool.
3) (20 points) print the prediction results to an output file named output.tsv. It should look like the
following. An example file sampleoutput.tsv is given to you.
1471976 1020868
1471976 628097
… …
1471976 284009
1471983 628097
1471983 891097

Your output file output.tsv should contain 150 lines, each of which has two tab-separated fields
(UserID, JobID). The UserID must belong to U2, and the JobID must have an EndDate after 2012-04-
09 00:00:00.
The 150 pairs of (UserID, JobID) should be ordered by how likely UserID has applied to JobID during
T2. (It is known that a UserID doesn’t apply to the same job twice. So if you find an application about
UserID and JobID in apps.tsv, the pair shouldn’t appear in output.tsv.)
We will use your output.tsv to assess how accurate your prediction is, including whether more likely
applications are ordered before less likely ones. (We have ground truth data about all the job
applications made by U2 during T2.)
To accomplish the tasks, you need to look for clues from users’ previous applications, demographic
information, and work history. You should consider compare different approaches and tune and
improve your prediction.
Your program should be executed by the following commands:
java your-main-class-name /path/to/data/file/directory/ /path/to/output.tsv
python your-script-file.py /path/to/data/file/directory/ /path/to/output.tsv
./a.out /path/to/data/file/directory/ /path/to/output.tsv
* /path/to/data/file/directory/ is the path (e.g., /home/john/data-mining/data/) to the directory that has
all 5 input files: users.tsv, jobs.tsv, apps.tsv, users_history.tsv, user2.tsv.
* /path/to/output.tsv is the path to the output file, e.g. /home/john/data-mining/output.tsv