Updated Jan 08 12:02:11 CDT 2014
In this assignment you will run a machine learning experiment using Weka, an open source framework for machine learning and data mining. You will generate a model that predicts the quality of wine based on its chemical attributes. You will train the model on the supplied training data and use the model to predict the correct output for unlabeled test data.
You'll turn in your homework as a single zip file, in Blackboard. Specifically:
PS1.txt
.
PS1.model
.
PS1.arff
.
PS1.txt
PS1.model
PS1.arff
Weka is available for Windows, Mac, and Linux from http://www.cs.waikato.ac.nz/ml/weka/. Click on the "Download" link on the left-hand side and download the Stable GUI version, which is currently 3.6. You may also wish to download a Weka manual from the "Documentation" page.
The dataset files are here:
This dataset is adapted from:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
This dataset contains data for 2700 white variants of the Portuguese "Vinho Verde" wine. For each variant, 11 chemical features were measured. Each of these is a numeric attribute. They are:
Each variant was tasted by three experts. Their ratings have been combined into a single quality label: "good" or "bad" Therefore this is a binary classification problem with numeric attributes.
The dataset has been randomly split into a training set (1890 variants) and a test set (810 variants). The training set contains both chemical features and quality labels. The test set contains only the chemical features.
View train.arff
and test.arff
in a text editor.
You should see something like this:
The files are in ARFF (Attribute-Relation File Format), a text format developed for Weka. At the top of each file you will see a list of attributes, followed by a data section with rows of comma separated values, one for each instance. The text and training files look similar, except that the last value for each training instance is a quality label and the last value for each test instance is a question mark, since these instances are unlabeled.
For this assignment you will not need to deal with the ARFF format directly, as Weka will handle reading and writing ARFF files for you. In future experiments you may have to convert between ARFF and another data format. (You can close the text editor.)
Run Weka. You will get a screen like the following:
From the Tools menu choose ArffViewer. In the window that opens, choose File→Open and open one of the data files. You should see something like the following:
Here you see the same data as in the text editor, but parsed into a spreadsheet-like format. Although you will not need the ArffViewer for this assignment, it is a useful tool to know about when working with Weka. (You can close the ArffViewer window.)
From the Weka GUI Choose click on the Explorer button to open the
Weka Explorer. The Explorer is the main tool in Weka, and the one you are most
likely to work with when setting up an experiment. For the remainder of this
assignment you will work within the Weka Explorer. The Explorer should open to
the "Preprocess" tab. The Preprocess tab allows you to inspect and
modify your dataset before passing it to a machine learning algorithm. Click
on the button that says "Open file..." and open
train.arff
. You should see something like this:
The attributes are listed in the bottom left, and summary statistics for the currently selected attribute are shown on the right side, along with a histogram. Click on each attribute (or use the down arrow key to move through them) and look at the corresponding histogram. You will notice that many numeric attributes have a "hump" shape; this is a common pattern for numeric attributes drawn from real-world data.
You will also notice that some attributes appear to have outliers on one or both sides of the distribution. The proper treatment of outliers varies from one experiment to another. For this assignment you can leave the outliers alone.
In this section you will see how to train a classifier on the data.
Click on the "Classify" tab. Choose ZeroR as the Classifier if it is not already chosen (it is under the "rules" subtree when you click on the "Choose" button). When used in a classification problem, ZeroR simply chooses the majority class. Under "Test options" select "Use training set", then click the "Start" button to run the classifier. You should see something like this:
The classifier output pane displays information about the model created by the classifier as well as the evaluated performance of the model. In the Summary section, the row "Correctly Classified Instances" reports the accuracy of the model.
Click on the "Choose" button and select J48 under the "trees" section. Notice that the field to the right of the "Choose" button updates to say "J48 -C 0.25 -M 2". This is a command-line representation of the current settings of J48. Click on this field to open up the configuration dialog for J48:
Each classifier has a configuration dialog such as this that shows the parameters of the algorithm as well as buttons at the top for more information. When you change the settings and close the dialog, the command line representation updates accordingly. For now we will use the default settings, so hit "Cancel" to close the dialog.
Under "Test options" select "Use training set", then click the "Start" button to run the classifier. After the classifier finishes, scroll up in the output pane. You should see a textual representation of the generated decision tree.
Scroll back down and record the percentage of Correctly Classified Instances. Now, under "Test options", select "Cross-validation" with 10 folds. Run the classifier again and record the percentage of Correctly Classified Instances.
In both cases, the final model that is generated is based on all of the training data. The difference is in how the accuracy of that model is estimated.
This is the main part of the assignment. Search through the classifiers in Weka and run some of them on the training set. You may want to try varying some of the classifier parameters as well. Choose the one you feel is most likely to generalize well to unseen examples--namely the unlabeled examples in the test set. Feel free to use validation strategies other than 10-fold cross-validation.
When you have built the classifier you want to submit, move on to the following sections.
To export a classifier model you have built:
PS1-James-Bond.model
).In order to grade your assignment it must be possible to load your model
file in Weka and run it on a labeled version of
test.arff
. You can load your model by right-clicking
in the Result list pane and selecting "Load model".
To generate an ARFF file with predictions for the test data, perform the following steps from within the Classify tab:
test.arff
.PS1-James-Bond.arff
).Put concise answers to the following questions in a text file, as described in the submission instructions.