EECS 349 Problem Set 4
Due 11:59PM Tuesday June 2
Instructions
Answer clearly and concisely. Some questions ask you to "describe a data
set." These data sets can be completely abstract creations. Often it
will help to draw a picture. Your argument doesn't have to be rigorously
formal but it should be convincing. To give you an idea of what we're looking
for, consider the following sample question:
Sample Question: With continuous attributes, nearest-neighbor
sometimes outperforms decision trees. Describe a data set in which nearest
neighbor is likely to outperform decision trees.
Sample Answer: Consider a data set with two continuous attributes x1
and x2 which lie between 0 and 1, where the target function is "return 1
if x2 > x1, and 0 otherwise." Decision trees must attempt to
approximate the separating line x1 = x2 using axes-parallel lines (a
"stair-step" function), which will require many distinct splits.
Thus, decision trees will be inefficient at both training and test time, and
could be inaccurate if there isn't enough data to generate enough splits to
approximate the separating line x1=x2 well. On the other hand, the lines of
the Voronoi diagram in nearest neighbor can be parallel or nearly parallel to
the separating line x1 = x2, so with a reasonable number of training examples
we would expect NN to approximate the target function well.
This sample answer is not mathematically precise,
but it is plausible and demonstrates that the writer knows the key
concepts about each approach.
Questions
Note, aside from the extra credit this homework is intended to be shorter than the previous homeworks
in the course.
- Give a data set with three examples that fall into two natural
clusters
(one with two examples and the other with one example) such that the following properties
hold:
- hierarchical clustering
always puts the right two examples together,
and
- sequential clustering with a limit of q=2 clusters and
Θ=1.1
will output the right clusters for one example ordering,
but not for some
other example ordering.
Include an explanation
for why the data set has each of these properties, as in the example given at the top of this document. Assume both algorithms
measure distance from an example x to a cluster
C as the
distance from x to the nearest example in C.(3 points)
-
Comparing clustering techniques:
- What is the key difference between k-means clustering and
the
other clustering techniques we discussed (sequential and greedy
hierarchical
clustering) that makes k-means less applicable to
examples with
nominal attributes? (0.5 points)
- Explain a method
for adapting k-means to overcome this
difficulty (0.5 points)
- Two major differences between SVMs and perceptrons are that SVMs can use kernels, and
SVMs also maximize margin. Why are these two properties of SVMs valuable? (1 point)
- In the first homework, you were asked to try to find two algorithms that gave very
different performance on a given data set. In this homework question, you'll instead try to
create a data set that gives very different performance for two given algorithms. Specifically,
can you create a data set of 1000 examples with at most 1000 attributes such that Weka gives
very different 10-fold cross-validation performance (in absolute value)
for nearest neighbor (IBk) vs. decision trees (J48)?
Important: you should use the default settings in Weka for each algorithm -- you can find
IBk under "lazy" in the "classifiers" section. Report exactly how you
generated the data set (including the code if applicable) and a table giving the 10-fold CV
accuracy of each algorithm on your data. You do not have to include an explanation
why your data set works the way it does, just describe it clearly and include the results. (4 points)
Extra Credit
- Repeat question 4, except using multi-layer perceptrons (found under "functions"
in Weka) vs. Naive Bayes classifiers (found under "bayes" in Weka). (2 points)