Sale!

CS 4341 Project 3 Decision Trees for Connect-4 solved

Original price was: $35.00.Current price is: $35.00. $21.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

Decision Trees for Connect-4

Project Goal

This project is designed to explore the use of Decision Trees.  When done, you should be able to fit trees and interpret their output.  Your goal is to run some studies using a standard package and build your intuition on how they work.  


Project Task
 
Input:
Training DatasetThe dataset is collected from multiple connect-4 game plays. It includes 1000 instances of <Board state, Winner>. For each instance, of a board state the data file tells you if its a WIN, LOSS or TIE. 
PS. 
This data set is collected from original Connect-4 game play instead of  the “Pop-out” version we used. You can find the training dataset in the attached trainDataset(2).csv  and the detail description in dataDescription.txt. 
Tasks:
    Feature Extraction
        Program Assignment 
You need to write a program to extract at least 3 features from the raw board state given in the training data set. 
Out of the 5+ features you need to complete the assignment, there are two required 2 features.
        The two required features you must build are:
                Which player has a piece at the bottom left corner of the board, and
                Which player has more pieces in the center columns.
 
Features can be any characteristic of the board, such as how many pieces a player has on a certain column, or who controls the bottom center piece. 
 
        Report Assignment
In your project report, you should provide a brief verbal description of each feature and some sentences to justify the feature design. 
In other words, you should describe how to calculate those features and why you think those features would work. 
 

Suggestions from Professor Heffernan

“This is a key part of the project, so please be creative. The better your features are the more likely they are to  perform better during the testing.  Though don’t expect every feature you generate will be a guaranteed success. “

    Model Building & Testing (Decision Tree)

        Implement a decision tree model in the programming language of your choice. The tree should take a board state as input and produce who will win if the game is continued. Make sure to include cross-validation functionality.
        Report Assignment      
In your project report, you should report which features seem to be best at predicting winners. Are there any surprises? Why is cross validation important? 
Submission Guidelines

Due Date:
    Due Date: Sunday, September 15, 11:59 pm
 
    Format:
Submit a zipped file in the format member1_member2_project3.zip in canvas.
Wrong name format will lead to 10 points loss. Everything should be in single zip file.

    Submission Content

The zip file contains:
    1. Program to generate features for the decision trees.
    2. Whatever your did to create your decision tree.  If you wrote python code you can submit that.  If used Weka then show me some screen shots of key things you did.
    3. A README file providing any information to help grade and execute your programs.
    4. A member1_member2_p3_report.(doc/pdf) file containing the information outlined in the Report section below.  
        (Only accept word or pdf format!) 
 

    Report: 

1. Program for Feature Generation

Your program should be able work on the command line and take an input file name as parameter and generate your defined features as output as follows:

 

./yourProgram  input.csv  output.csv

A simple readMe.txt is necessary to explain how to execute your program.
The <input.csv> is where you can pass in the file with 1000 training examples.
The TA will take your program and pass a different file name that will test another set of 1000 instances that we do not provide you.
The <output.csv> should look exactly like the data in <input> but with new columns concatenated to the right.
Given you have to make up extra features, for each extra feature you have, you will have an extra column in this file.

 

           There is something about using Weka for uploading files in the FAQ section.

If you are using python 3, there is an example script and *.csv file in argumenttester.zip that you can use to learn to read in and output *.csv files using command line in Windows. Just extract the folder to somewhere on your machine where permissions are not an issue (the Desktop works). Then, type in your search bar on the lower left “cmd” and then hit enter. If you are using Anaconda prompt, you can open that up instead. The Windows command prompt starts up on my machine like this:

c:\Users\MySuperSecretUsername>

In order to navigate to the Desktop, type this in:

cd Desktop

You can use autocomplete, aka. pressing the tab key, to finish the name when you start pressing “De”.

Then run this command to run the script:

python argumenttester.py 1 takemeon.csv rejectme.csv

You will see some printed messages on the console. This script will also create a *.csv file called rejectme.csv

You have now run your python script similarly for a Unix Machine.

The graders will be able to then run

sudo chmod +x yourpythonfile.py

on a Unix machine and then run

./yourpythonfile.py argument1.csv argument2.csv

to run your file. Hopefully we will have an example for command line java or cpp in a bit.

2. Decision Tree Program

Go learn how to create some decision tree and random forest.
Use whatever library you want to learn up a decision tree.  You can use Weka (if you like Java) or you can use scikit of your like python (http://scikit-learn.org/stable/).  (You dont have to implement dtrees by hand but this site shows you have to do that.)

 

                     3. Project Report 

4.1 Decision Trees
  1. Explain your definition of your 5+ features and why you thought they should be useful.
  2. Explain what you did for testing your decision trees, what sort of cross validation you used, the corresponding result, and the final analysis.
    1. Similar to the past project test three ideas on how to make a more informative model. (most packages will have many knobs you can turn; find some knobs and turn them and report what you find.)
      1. You might want to compare the number of trees in a random forest.
      2. You might want to compare pruning methods in a decision tree (turning on or off ChiSquare pruning)
Hints: 
        Tell us a little bit about the value of your different features. Which one is most useful? How did you figure that out? Or are some subsets of them useful? How can you figure these out?  (I would like you to take one of the trees you learned and tell me about the importance of each feature by looking at the tree and telling me something about it. I would also like for you to come up with a statistical way of saying which features worked best.  Can you come up with a way?)
        I suggest you compare a strategy of leave-one-feature-out, and a strategy of testing-each-feature-separately. This will require you to run many different models. Do these two strategies rank the importance of the features the same? Tell me how you can use such a strategy to explain how these  features were ranked.
Bonus: 
        Have you tried other features selection approaches? If yes, name it and tell us a little bit about its performance or advantages compared to decision tree. (5 points).

 

 
 

 
Grade Rubric
Your program will be tested on another 1000 board states. Evaluation will be done based on predication accuracy.  All file your program need for training is trainDataSet.csv. 
 
Component  Content Point
Report  Feature description 

10

Feature design strategy. Did you make some good features? 

20

Did you use cross validation intelligently? 

10

How did you figure out what feature were important?  Tell us what you did and show us what you found. Did you use sound reasoning in interpreting your output.

40

Clear and understandable report.

20

Sum 100

FAQ

 
Q1:  I like java  What are the acceptable file formats of WEKA. 
A1:  WEKA accept multiple file formats includes: 
        WEKA’s ARFF format,
        CSV format,
        C4.5 format,
        serialized Instances format
        This information can be found by “Using the Open file… button you can read files in a variety of formats”
        Detail:  ARFF files typically have a .arff extension, CSV files a .csv extension, C4.5 files a .data and .names extension, and serialized Instances objects have a .bsi extension.  For this project, a .csv input file is most convenient.
.csv file looks like:
feature1, feature2,…. ClassLabel 
value1,value2, ….., ClassValue 
The first line is the names of defined features + classLabel. 
 
Q2: Does player 1 operate first on each board state?
A2: Yes. The first player–player 1 is allowed to operate firstly in the future. 
 
Q3: Is the number of moves in board states fixed? 
A3: No. Different board states may have different numbers of moves. 
 
Q4: What else can you tell us about feature definitions?
A4: Any kind of creative features are welcome. 
        You can define features about the difference between two players in some evaluation metrics (Think about what did in project 1). 
        The purpose of the example figure is to show the data format for submission. So feel free to define creative features. 
 
        Representative feature definition:
        centerControl:
  • center disc: discs in the center column owned by each player
  • centerControl is the difference of the number of center discs between the first player and the second player.
  • Type: Numeric
 
The first given input in trainDataSet:
 
1,0,0,0,0,0,0,0,0,0,0,0,2,1,2,0,0,0,0,0,0,0,0,0,2,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
 
 
should be considered a board state that looks like this:

Red: the first player; Yellow: the second player: 

Central columns would be these:
The value of central control should be 3-4=-1

Extra Credit Options

 
 
1. “Binning” Attributes:  Moving from continuous to categorical values
 
Hint from Prof Heffernan: 
 Your d-trees are probably treating each different value in each feature separately. Meaning, lets say you have a feature like “control of the center” that is computed by taking the number in the center columns that player one has and subtracting the number of chips the second player has in the center columns. This will be a continuous number ranging from say -7 (Player 2 has complete control) to positive 7 (Player 1 has complete control).
Your D tree might learn, when the the value is -3, that player 2 is very likely to win in most cases, but at values -4 and -5, player 1 actually has a high chance of winning, regardless of player 2’s supposed advantage. So to deal with this you can make a new feature where you bin-ize your feature to a small sets of bins. For instance, in this example you might say if the value is less than -2, versus over positive 2. This would give you three bins.
You might find value in your features this way. You will lose precision for each value, but since each bin will have more data points, you are more likely to estimate a good value for that mean.  Be sure not to take this concept too far: if you bin an attribute too much, the cases become more and more specific to the point the data becomes less than useful in most normal situations.
 
2. Feature Generation by Combining Features: 
 
Another thing you can do it take any two of the features you have and make new features!  For instance you can add, subtract or multiply features together to make new, more specific features. Basically, take your 5 features and make 25 new ones by doing pair wise multiplication and get 25 new ones. By squaring a features you might find value but I would think that would only be true when you are using the features in a continuous way.
This could help but maybe not if you make many different new values for these new features, as you think have many different options. The ones that don’t help you can ignore. You can then take all 5 features and add them together pair wires, or subtract them pair wise to get 50 new features (subtracting might be handy if you have two features that are similar but for the the two different players). In class I talked about how the winner of the KDD Cup generated 1 million features; this is the sort of thing they did.  Most of them time the features will be no good but you might find some really great features that increase you predictive accuracy.  
 
        One somewhat easy idea it to take your features and bin-ize them in say two and three values, based upon their means.  Then try multiply them together and adding and subtracting them together. Binizeing (“discretizing” is the formal term; no one but me says “binizing” to put into bins so don’t use my language  🙂  ) Weka can possibly do this for you. Heres a tutorial: 
 
 
3. If you like java you might like WEKA.  They implement decision trees. 
 
Here is a cool video by the creator or Weka, Ian Witten. He was super nice guy when I met him. Check on this video as he gives some conceptual background but also shows how to do some stuff in Weka.