EECS 349 Problem Set 4

Due 11:59PM Tuesday June 2

Instructions

Answer clearly and concisely. Some questions ask you to "describe a data set." These data sets can be completely abstract creations. Often it will help to draw a picture. Your argument doesn't have to be rigorously formal but it should be convincing. To give you an idea of what we're looking for, consider the following sample question:

Sample Question: With continuous attributes, nearest-neighbor sometimes outperforms decision trees. Describe a data set in which nearest neighbor is likely to outperform decision trees.

Sample Answer: Consider a data set with two continuous attributes x1 and x2 which lie between 0 and 1, where the target function is "return 1 if x2 > x1, and 0 otherwise." Decision trees must attempt to approximate the separating line x1 = x2 using axes-parallel lines (a "stair-step" function), which will require many distinct splits. Thus, decision trees will be inefficient at both training and test time, and could be inaccurate if there isn't enough data to generate enough splits to approximate the separating line x1=x2 well. On the other hand, the lines of the Voronoi diagram in nearest neighbor can be parallel or nearly parallel to the separating line x1 = x2, so with a reasonable number of training examples we would expect NN to approximate the target function well.

This sample answer is not mathematically precise, but it is plausible and demonstrates that the writer knows the key concepts about each approach.

Questions

Note, aside from the extra credit this homework is intended to be shorter than the previous homeworks in the course.

Give a data set with three examples that fall into two natural clusters (one with two examples and the other with one example) such that the following properties hold:
1. hierarchical clustering always puts the right two examples together, and
2. sequential clustering with a limit of q=2 clusters and Θ=1.1 will output the right clusters for one example ordering, but not for some other example ordering.
Include an explanation for why the data set has each of these properties, as in the example given at the top of this document. Assume both algorithms measure distance from an example x to a cluster C as the distance from x to the nearest example in C.(3 points)
Comparing clustering techniques:
1. What is the key difference between k-means clustering and the other clustering techniques we discussed (sequential and greedy hierarchical clustering) that makes k-means less applicable to examples with nominal attributes? (0.5 points)
2. Explain a method for adapting k-means to overcome this difficulty (0.5 points)
Two major differences between SVMs and perceptrons are that SVMs can use kernels, and SVMs also maximize margin. Why are these two properties of SVMs valuable? (1 point)
In the first homework, you were asked to try to find two algorithms that gave very different performance on a given data set. In this homework question, you'll instead try to create a data set that gives very different performance for two given algorithms. Specifically, can you create a data set of 1000 examples with at most 1000 attributes such that Weka gives very different 10-fold cross-validation performance (in absolute value) for nearest neighbor (IBk) vs. decision trees (J48)? Important: you should use the default settings in Weka for each algorithm -- you can find IBk under "lazy" in the "classifiers" section. Report exactly how you generated the data set (including the code if applicable) and a table giving the 10-fold CV accuracy of each algorithm on your data. You do not have to include an explanation why your data set works the way it does, just describe it clearly and include the results. (4 points)

Extra Credit

Repeat question 4, except using multi-layer perceptrons (found under "functions" in Weka) vs. Naive Bayes classifiers (found under "bayes" in Weka). (2 points)