EECS 349 Problem Set 1

Due 11:59PM Tuesday, Apr 14

Updated Apr 2 11:30:00 CDT 2015


In this assignment you will run a machine learning experiment using Weka, an open source framework for machine learning and data mining. You will generate a model that predicts the quality of wine based on its chemical attributes. You will train the model on the supplied training data and use the model to predict the correct output for unlabeled test data.

Submission Instructions

You'll turn in your homework as a single zip file, in Canvas. Specifically:

  1. Create a text file with your answers to the questions below. Be sure you have answered all the questions. Name this file PS1.txt.
  2. Create a file containing your Weka model (instructions below). Be sure this file can be loaded into Weka and that it runs. Name this file PS1.model.
  3. Create a text file in ARFF format with your predicted labels for the test set (instructions below). Name this file PS1.arff.
  4. Create a single ZIP file containing:
  5. Turn the zip file in under Problem Set 1 in Canvas.

Download and Install Weka

Weka is available for Windows, Mac, and Linux from http://www.cs.waikato.ac.nz/ml/weka/. Click on the "Download" link on the left-hand side and download the Stable GUI version, which is currently 3.6. You may also wish to download a Weka manual from the "Documentation" page.

Note

Some points that may help with Weka installation on Mac OS X:

  1. The Weka download page lists a version that works with JVM 1.6 and one that comes bundled with JVM 1.7. If you have a JVM version higher than 1.7 already installed on your Mac, you should download the Weka version bundled with JVM 1.7.
  2. Installing Weka on a Mac simply requires copying the downloaded contents to a new folder under Applications. In other words, there is no .dmg file to run.

Download the Dataset

The dataset files are here:

This dataset is adapted from:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

This dataset contains data for 2700 white variants of the Portuguese "Vinho Verde" wine. For each variant, 11 chemical features were measured. Each of these is a numeric attribute. They are:

Each variant was tasted by three experts. Their ratings have been combined into a single quality label: "good" or "bad" Therefore this is a binary classification problem with numeric attributes.

The dataset has been randomly split into a training set (1890 variants) and a test set (810 variants). The training set contains both chemical features and quality labels. The test set contains only the chemical features.

Examine the Data

It is a good idea to inspect your data by hand before running any machine learning experiments, to ensure that the dataset is in the correct format and that you understand what the dataset contains. The following sections will familiarize you with the data and introduce some tools in Weka.

The ARFF Format

View train.arff and test.arff in a text editor. You should see something like this:

The files are in ARFF (Attribute-Relation File Format), a text format developed for Weka. At the top of each file you will see a list of attributes, followed by a data section with rows of comma separated values, one for each instance. The text and training files look similar, except that the last value for each training instance is a quality label and the last value for each test instance is a question mark, since these instances are unlabeled.

For this assignment you will not need to deal with the ARFF format directly, as Weka will handle reading and writing ARFF files for you. In future experiments you may have to convert between ARFF and another data format. (You can close the text editor.)

The Weka ARFF Viewer

Run Weka. You will get a screen like the following:

From the Tools menu choose ArffViewer. In the window that opens, choose FileOpen and open one of the data files. You should see something like the following (see important note below):

Here you see the same data as in the text editor, but parsed into a spreadsheet-like format. Although you will not need the ArffViewer for this assignment, it is a useful tool to know about when working with Weka. (You can close the ArffViewer window.)

Important Note

You may find that the ARFF files are grayed out and that the All Files option needs to be selected from the File Format dropdown menu for the files to be selectable. However, the ARFF Viewer may still not read the files properly. If such is the case, it is likely that a .txt extension got appended to the filename when the files were downloaded. However, even if the files are downloaded without .txt getting appended or an inadvertently added .txt extension is removed, the ARFF Viewer may have trouble reading the files properly. The following steps should resolve the issue:

  1. View the ARFF in your Web browser by clicking on the link in the instructions or open the downloaded ARFF file in a text editor.
  2. Copy all the text and paste it to a new text file.
  3. If you copied the ARFF contents from the downloaded ARFF file, it is recommended that you do not overwrite the downloaded ARFF file when saving the new file on the next step. Instead, delete the downloaded ARFF file.
  4. Save the new text file with a .arff extension, carefully making sure that a .txt extension does not get appended.
  5. Open the newly saved ARFF file in the Weka ARFF Viewer to verify the Viewer can display the file in the manner illustrated in the image above.

The Weka Explorer

From the Weka GUI Choose click on the Explorer button to open the Weka Explorer. The Explorer is the main tool in Weka, and the one you are most likely to work with when setting up an experiment. For the remainder of this assignment you will work within the Weka Explorer. The Explorer should open to the "Preprocess" tab. The Preprocess tab allows you to inspect and modify your dataset before passing it to a machine learning algorithm. Click on the button that says "Open file..." and open train.arff. You should see something like this:

The attributes are listed in the bottom left, and summary statistics for the currently selected attribute are shown on the right side, along with a histogram. Click on each attribute (or use the down arrow key to move through them) and look at the corresponding histogram. You will notice that many numeric attributes have a "hump" shape; this is a common pattern for numeric attributes drawn from real-world data.

You will also notice that some attributes appear to have outliers on one or both sides of the distribution. The proper treatment of outliers varies from one experiment to another. For this assignment you can leave the outliers alone.

Now answer Question #1.

Classifier Basics

In this section you will see how to train a classifier on the data.

Baseline Classifier

Click on the "Classify" tab. Choose ZeroR as the Classifier if it is not already chosen (it is under the "rules" subtree when you click on the "Choose" button). When used in a classification problem, ZeroR simply chooses the majority class. Under "Test options" select "Use training set", then click the "Start" button to run the classifier. You should see something like this:

The classifier output pane displays information about the model created by the classifier as well as the evaluated performance of the model. In the Summary section, the row "Correctly Classified Instances" reports the accuracy of the model.

Now answer Question #2.

Decision Trees

J48 is the Weka implementation of the C4.5 decision tree algorithm.

Click on the "Choose" button and select J48 under the "trees" section. Notice that the field to the right of the "Choose" button updates to say "J48 -C 0.25 -M 2". This is a command-line representation of the current settings of J48. Click on this field to open up the configuration dialog for J48:

Each classifier has a configuration dialog such as this that shows the parameters of the algorithm as well as buttons at the top for more information. When you change the settings and close the dialog, the command line representation updates accordingly. For now we will use the default settings, so hit "Cancel" to close the dialog.

Under "Test options" select "Use training set", then click the "Start" button to run the classifier. After the classifier finishes, scroll up in the output pane. You should see a textual representation of the generated decision tree.

Now answer Question #3.

Scroll back down and record the percentage of Correctly Classified Instances. Now, under "Test options", select "Cross-validation" with 10 folds. Run the classifier again and record the percentage of Correctly Classified Instances.

In both cases, the final model that is generated is based on all of the training data. The difference is in how the accuracy of that model is estimated.

Now answer Question #4.

Build Your Own Classifier

This is the main part of the assignment. Search through the classifiers in Weka and run some of them on the training set. You may want to try varying some of the classifier parameters as well. Choose the one you feel is most likely to generalize well to unseen examples--namely the unlabeled examples in the test set. Feel free to use validation strategies other than 10-fold cross-validation.

When you have built the classifier you want to submit, move on to the following sections.

Saving the Model

To export a classifier model you have built:

  1. Right-click on the model in the "Result list" in the bottom left corner of the Classify tab.
  2. Select "Save model".
  3. In the dialog that opens, ensure that the File Format is "Model object files"
  4. Save the model using the naming convention given in the submission instructions (e.g. PS1.model).

In order to grade your assignment it must be possible to load your model file in Weka and run it on a labeled version of test.arff. You can load your model by right-clicking in the Result list pane and selecting "Load model".

Generating Predictions

To generate an ARFF file with predictions for the test data, perform the following steps from within the Classify tab:

  1. Under "Test options" select "Supplied test set".
  2. Click on the "Set..." button.
  3. In the "Test Instances" dialog that opens click "Open file...".
  4. Open test.arff.
  5. Close the Test Instances dialog.
  6. Right-click on your model in the Result list and select "Re-evaluate model on current test set". Your output will look something like the picture below. Notice that the output contains a bunch of NaNs. This is because the test data is unlabeled and therefore Weka cannot compute the accuracy.
  7. Right-click again on your model and select "Visualize classifier errors".
  8. In the dialog that opens, click on the "Save" button.
  9. Save the ARFF file using the naming convention given in the submission instructions (e.g. PS1.arff).
Now answer Questions #5 through #8.

Try Another Data Set

You will now build a classifier for a second data set concerning the evaluation of cars, following which you will answer only the last two questions. (You do not have to answer Questions #1 through #8 again.)

In order to answer the questions, perform the following steps:

Download the Car Evaluation Dataset

The car evaluation dataset files are here (see important note below)::

This dataset is adapted from:

Car Evaluation Database, which was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.).

Important Note

The main data file for the car evaluation data set ends in a .data extension and has an associated auxiliary data file ending in a .names extension. However, the usage for the .data file is the same as for the .arff file you are already familiar with, including the important note applicable to the wine evaluation data set files, though you have to pay special attention to the following:

  1. When opening the main data file (car_train.data or car_test.data) in the Weka Explorer, the C4.5 data files (*.data) option needs to be selected from the File Format dropdown menu.
  2. The auxiliary data files (car_train.names and car_test.names) must be located in the same folder as the main data files. (You do not need to take any action on these auxiliary data files other than to keep them in the same folder as the main data files, but inspecting the contents of the files should help you interpret how the main data files work.)

Build Classifiers

You will perform four experiments, measuring the 10-fold cross-validation accuracy of two types of classifiers (call them classifiers A and B) on two data sets (cars and wine). You can choose A and B however you like -- they can be different classifiers (nearest-neighbor vs. decision trees) or the same classifier with different settings (different numbers of nearest neighbors, for example). Your goal is to choose settings such that the A classifier performs great for wine evaluation, but poorly for car evaluation, and vice-versa for classifier B. In other words, you should strive to find a value as large as you can for the expression below:

wine_acc(A) + car_acc(B) – wine_acc(B) – car_acc(A)

where wine_acc(A) refers to the accuracy of Classifier A on the wine data set, and car_acc(B) refers to the accuracy of Classifier B on the car data set, and so on.

Note: You do not need to obtain the largest possible quantity for the above expression, and it is okay to use classifiers we have discussed in class as long as you can achieve some positive value for the above expression (a value of 2% is sufficient for the assignment).

Note that you will only need to use the training data (car_train.data) for this task. The test data (car_test.data) is provided for your personal reference, should you choose to try your car evaluation classifier on it to gain experience. Therefore, you do not need to perform the steps under Generating Predictions for this task.

Now answer the remaining questions.

Questions

Put concise answers to the following questions in a text file, as described in the submission instructions.

  1. Which attributes in the training set do not appear to have a "hump" distribution? Which attributes appear to have outliers? (Do not worry too much about being precise here. The point is for you to inspect the data and interpret what you see.)
  2. What is the accuracy - the percentage of correctly classified instances - achieved by ZeroR when you run it on the training set? Explain this number. How is the accuracy of ZeroR a helpful baseline for interpreting the performance of other classifiers?
  3. Using a decision tree Weka learned over the training set, what is the most informative single feature for this task, and what is its influence on wine quality?
  4. What is 10-fold cross-validation? What is the main reason for the difference between the percentage of Correctly Classified Instances when you used the entire training set directly versus when you ran 10-fold cross-validation on the training set? Why is cross-validation important?
  5. What is the "command-line" for the model you are submitting? For example, "J48 -C 0.25 -M 2". What is the reported accuracy for your model using 10-fold cross-validation?
  6. In a few sentences, describe how you chose the model you are submitting. Be sure to mention your validation strategy and whether you tried varying any of the model parameters.
  7. You used the entire training set to train an algorithm for evaluation on the test set. In principle, what number of cross-validation folds on the training set is the least biased predictor (from the standpoint of measuring on the right spot on the learning curve) of performance on the test set? You may refer to the number of training examples n in your answer.
  8. A Wired article from several years ago on the 'Peta Age' suggests that increasingly huge data sets, coupled with machine learning techniques, makes model building obsolete. In particular it says: This is a world where massive amounts of data and applied mathe­matics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology… In a short paragraph (about four sentences), state whether you agree with this statement, and why or why not.
  9. In your opinion, what is/are the reasons why Classifiers A and B are each suited better for one data set and not the other?
  10. What is the key difference about the output space for the car task, as compared to the wine task?