EECS 349 Problem Set 1

Due 11:59PM Tuesday, Jan 21

Updated Jan 08 12:02:11 CDT 2014

In this assignment you will run a machine learning experiment using Weka, an open source framework for machine learning and data mining. You will generate a model that predicts the quality of wine based on its chemical attributes. You will train the model on the supplied training data and use the model to predict the correct output for unlabeled test data.

Submission Instructions

You'll turn in your homework as a single zip file, in Blackboard. Specifically:

Create a text file with your answers to the questions below. Be sure you have answered all the questions. Name this file PS1.txt.
Create a file containing your Weka model (instructions below). Be sure this file can be loaded into Weka and that it runs. Name this file PS1.model.
Create a text file in ARFF format with your predicted labels for the test set (instructions below). Name this file PS1.arff.
Create a single ZIP file containing:
- PS1.txt
- PS1.model
- PS1.arff
Turn the zip file in under Problem Set 1 in Blackboard.

Download and Install Weka

Weka is available for Windows, Mac, and Linux from http://www.cs.waikato.ac.nz/ml/weka/. Click on the "Download" link on the left-hand side and download the Stable GUI version, which is currently 3.6. You may also wish to download a Weka manual from the "Documentation" page.

Download the Dataset

The dataset files are here:

train.arff (labeled training set, 1890 instances)
test.arff (unlabeled test set, 810 instances)

This dataset is adapted from:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

This dataset contains data for 2700 white variants of the Portuguese "Vinho Verde" wine. For each variant, 11 chemical features were measured. Each of these is a numeric attribute. They are:

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides numeric
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

Each variant was tasted by three experts. Their ratings have been combined into a single quality label: "good" or "bad" Therefore this is a binary classification problem with numeric attributes.

The dataset has been randomly split into a training set (1890 variants) and a test set (810 variants). The training set contains both chemical features and quality labels. The test set contains only the chemical features.

Examine the Data

It is a good idea to inspect your data by hand before running any machine learning experiments, to ensure that the dataset is in the correct format and that you understand what the dataset contains. The following sections will familiarize you with the data and introduce some tools in Weka.

The ARFF Format

View train.arff and test.arff in a text editor. You should see something like this:

The files are in ARFF (Attribute-Relation File Format), a text format developed for Weka. At the top of each file you will see a list of attributes, followed by a data section with rows of comma separated values, one for each instance. The text and training files look similar, except that the last value for each training instance is a quality label and the last value for each test instance is a question mark, since these instances are unlabeled.

For this assignment you will not need to deal with the ARFF format directly, as Weka will handle reading and writing ARFF files for you. In future experiments you may have to convert between ARFF and another data format. (You can close the text editor.)

The Weka ARFF Viewer

Run Weka. You will get a screen like the following:

From the Tools menu choose ArffViewer. In the window that opens, choose File→Open and open one of the data files. You should see something like the following:

Here you see the same data as in the text editor, but parsed into a spreadsheet-like format. Although you will not need the ArffViewer for this assignment, it is a useful tool to know about when working with Weka. (You can close the ArffViewer window.)

The Weka Explorer

From the Weka GUI Choose click on the Explorer button to open the Weka Explorer. The Explorer is the main tool in Weka, and the one you are most likely to work with when setting up an experiment. For the remainder of this assignment you will work within the Weka Explorer. The Explorer should open to the "Preprocess" tab. The Preprocess tab allows you to inspect and modify your dataset before passing it to a machine learning algorithm. Click on the button that says "Open file..." and open train.arff. You should see something like this:

The attributes are listed in the bottom left, and summary statistics for the currently selected attribute are shown on the right side, along with a histogram. Click on each attribute (or use the down arrow key to move through them) and look at the corresponding histogram. You will notice that many numeric attributes have a "hump" shape; this is a common pattern for numeric attributes drawn from real-world data.

You will also notice that some attributes appear to have outliers on one or both sides of the distribution. The proper treatment of outliers varies from one experiment to another. For this assignment you can leave the outliers alone.

Now answer Question #1.

Classifier Basics

In this section you will see how to train a classifier on the data.

Baseline Classifier

Click on the "Classify" tab. Choose ZeroR as the Classifier if it is not already chosen (it is under the "rules" subtree when you click on the "Choose" button). When used in a classification problem, ZeroR simply chooses the majority class. Under "Test options" select "Use training set", then click the "Start" button to run the classifier. You should see something like this:

The classifier output pane displays information about the model created by the classifier as well as the evaluated performance of the model. In the Summary section, the row "Correctly Classified Instances" reports the accuracy of the model.

Now answer Question #2.

Decision Trees

J48 is the Weka implementation of the C4.5 decision tree algorithm.

Click on the "Choose" button and select J48 under the "trees" section. Notice that the field to the right of the "Choose" button updates to say "J48 -C 0.25 -M 2". This is a command-line representation of the current settings of J48. Click on this field to open up the configuration dialog for J48:

Each classifier has a configuration dialog such as this that shows the parameters of the algorithm as well as buttons at the top for more information. When you change the settings and close the dialog, the command line representation updates accordingly. For now we will use the default settings, so hit "Cancel" to close the dialog.

Under "Test options" select "Use training set", then click the "Start" button to run the classifier. After the classifier finishes, scroll up in the output pane. You should see a textual representation of the generated decision tree.

Now answer Question #3.

Scroll back down and record the percentage of Correctly Classified Instances. Now, under "Test options", select "Cross-validation" with 10 folds. Run the classifier again and record the percentage of Correctly Classified Instances.

In both cases, the final model that is generated is based on all of the training data. The difference is in how the accuracy of that model is estimated.

Now answer Question #4.

Build Your Own Classifier

This is the main part of the assignment. Search through the classifiers in Weka and run some of them on the training set. You may want to try varying some of the classifier parameters as well. Choose the one you feel is most likely to generalize well to unseen examples--namely the unlabeled examples in the test set. Feel free to use validation strategies other than 10-fold cross-validation.

When you have built the classifier you want to submit, move on to the following sections.

Saving the Model

To export a classifier model you have built:

Right-click on the model in the "Result list" in the bottom left corner of the Classify tab.
Select "Save model".
In the dialog that opens, ensure that the File Format is "Model object files"
Save the model using the naming convention given in the submission instructions (e.g. PS1-James-Bond.model).

In order to grade your assignment it must be possible to load your model file in Weka and run it on a labeled version of test.arff. You can load your model by right-clicking in the Result list pane and selecting "Load model".

Generating Predictions

To generate an ARFF file with predictions for the test data, perform the following steps from within the Classify tab:

Under "Test options" select "Supplied test set".
Click on the "Set..." button.
In the "Test Instances" dialog that opens click "Open file...".
Open test.arff.
Close the Test Instances dialog.
Right-click on your model in the Result list and select "Re-evaluate model on current test set". Your output will look something like the picture below. Notice that the output contains a bunch of NaNs. This is because the test data is unlabeled and therefore Weka cannot compute the accuracy.
Right-click again on your model and select "Visualize classifier errors".
In the dialog that opens, click on the "Save" button.
Save the ARFF file using the naming convention given in the submission instructions (e.g. PS1-James-Bond.arff).

Now answer the remaining questions.

Questions

Put concise answers to the following questions in a text file, as described in the submission instructions.

Which attributes in the training set do not appear to have a "hump" distribution? Which attributes appear to have outliers? (Do not worry too much about being precise here. The point is for you to inspect the data and interpret what you see.)
What is the accuracy - the percentage of correctly classified instances - achieved by ZeroR when you run it on the training set? Explain this number. How is the accuracy of ZeroR a helpful baseline for interpreting the performance of other classifiers?
Using a decision tree Weka learned over the training set, what is the most informative single feature for this task, and what is its influence on wine quality?
What is 10-fold cross-validation? What is the main reason for the difference between the percentage of Correctly Classified Instances when you used the entire training set directly versus when you ran 10-fold cross-validation on the training set? Why is cross-validation important?
What is the "command-line" for the model you are submitting? For example, "J48 -C 0.25 -M 2". What is the reported accuracy for your model using 10-fold cross-validation?
In a few sentences, describe how you chose the model you are submitting. Be sure to mention your validation strategy and whether you tried varying any of the model parameters.
You used the entire training set to train an algorithm for evaluation on the test set. In principle, what number of cross-validation folds on the training set is the best predictor of performance on the test set?
The Wired article on the 'Peta Age' suggests that increasingly huge data sets, coupled with machine learning techniques, makes model building obsolete. In a short paragraph (about four sentences), state whether you agree with this conclusion, and why or why not.