EECS 349 Problem Set 4

Due 11:59PM Thursday March 13

v1.0 Thu Mar 6 12:14:13 CDT 2014

Instructions

Answer clearly and concisely. Some questions ask you to "describe a data set." These data sets can be completely abstract creations. Often it will help to draw a picture. Your argument doesn't have to be rigorously formal but it should be convincing. To give you an idea of what we're looking for, consider the following sample question:

Sample Question: With continuous attributes, nearest-neighbor sometimes outperforms decision trees. Describe a data set in which nearest neighbor is likely to outperform decision trees.

Sample Answer: Consider a data set with two continuous attributes x1 and x2 which lie between 0 and 1, where the target function is "return 1 if x2 > x1, and 0 otherwise." Decision trees must attempt to approximate the separating line x1 = x2 using axes-parallel lines (a "stair-step" function), which will require many distinct splits. Thus, decision trees will be inefficient at both training and test time, and could be inaccurate if there isn't enough data to generate enough splits to approximate the separating line x1=x2 well. On the other hand, the lines of the Voronoi diagram in nearest neighbor can be parallel or nearly parallel to the separating line x1 = x2, so with a reasonable number of training examples we would expect NN to approximate the target function well.

This sample answer is not mathematically precise, but it is plausible and demonstrates that the writer knows the key concepts about each approach.

Questions

Note, aside from the extra credit this homework is intended to be shorter than the previous homeworks in the course.

Give a data set with three examples that fall into two natural clusters, such that both of the following properties hold:
1. hierarchical clustering, at the point when it has two clusters, always has the right clusters, and
2. sequential clustering with a limit of q=2 clusters and Θ=1.1 will output the right clusters for one example ordering, but not for some other example ordering.
Assume both algorithms measure distance from an example x to a cluster C as the distance from x to the nearest example in C. (5 points)
Clustering Techniques Compared.
1. What is the key difference between k-means clustering and the other clustering techniques we discussed (sequential and greedy hierarchical clustering) that makes k-means less applicable to examples with nominal attributes? (1 point)
2. Can you devise a way to adapt part of k-means to overcome this difficulty? (2 points)
In two sentences, name two key differences between SVMs and perceptrons. (2 points)

Extra Credit

In homework #3 question 1, we asked which algorithms would perform best on certain data sets. For data sets (b) and (c), test empirically which of Nearest neighbor, J48 Decision Trees, or neural networks performs best. You should create multiple (e.g., 30 or so) different synthetic data sets of both types, and then try 10-fold CV using each classifier in Weka. You can receive up to 4 points extra credit for this question. Plus, if your data makes a convincing case that your homework #3 answers were correct and we marked them wrong, you should note this in your assignment and we will return those points to you as well. Turn in your code that generated the data, along with a brief report on your results. (4 points)
The Google Code Jam qualification round happens March 11. This contest is not hugely relevant to machine learning, but it is a fun programming exercise. For extra credit, register for the contest. Unfortunately the qualification round for 2014 doesn't happen right away, so for now you can practice on the qualification round from last year. Try one or more of the four questions from 2013's qualification round, and report on which ones you answered and whether you got them right. Include your code.(2 points)

Version History
1.0	Thu Mar 6 12:14:13 CDT 2014	Initial version.