EECS 349 Problem Set 4
Due 11:59PM Thursday March 13
v1.0
Thu Mar 6 12:14:13 CDT 2014
Instructions
Answer clearly and concisely. Some questions ask you to "describe a data
set." These data sets can be completely abstract creations. Often it
will help to draw a picture. Your argument doesn't have to be rigorously
formal but it should be convincing. To give you an idea of what we're looking
for, consider the following sample question:
Sample Question: With continuous attributes, nearest-neighbor
sometimes outperforms decision trees. Describe a data set in which nearest
neighbor is likely to outperform decision trees.
Sample Answer: Consider a data set with two continuous attributes x1
and x2 which lie between 0 and 1, where the target function is "return 1
if x2 > x1, and 0 otherwise." Decision trees must attempt to
approximate the separating line x1 = x2 using axes-parallel lines (a
"stair-step" function), which will require many distinct splits.
Thus, decision trees will be inefficient at both training and test time, and
could be inaccurate if there isn't enough data to generate enough splits to
approximate the separating line x1=x2 well. On the other hand, the lines of
the Voronoi diagram in nearest neighbor can be parallel or nearly parallel to
the separating line x1 = x2, so with a reasonable number of training examples
we would expect NN to approximate the target function well.
This sample answer is not mathematically precise,
but it is plausible and demonstrates that the writer knows the key
concepts about each approach.
Questions
Note, aside from the extra credit this homework is intended to be shorter than the previous homeworks
in the course.
- Give a data set with three examples that fall into two natural
clusters,
such that both of the following properties
hold:
- hierarchical clustering,
at the point when it has two clusters,
always has the right clusters,
and
- sequential clustering with a limit of q=2 clusters and
Θ=1.1
will output the right clusters for one example ordering,
but not for some
other example ordering.
Assume both algorithms
measure distance from an example x to a cluster
C as the
distance from x to the nearest example in C.
(5 points)
-
Clustering Techniques Compared.
- What is the key difference between k-means clustering and
the
other clustering techniques we discussed (sequential and greedy
hierarchical
clustering) that makes k-means less applicable to
examples with
nominal attributes? (1 point)
- Can you devise a
way to adapt part of k-means to overcome this
difficulty? (2 points)
- In two sentences, name two key differences between SVMs and perceptrons. (2 points)
Extra Credit
- In homework #3 question 1, we asked which algorithms would perform best on certain data sets.
For data sets (b) and (c), test empirically which of Nearest neighbor, J48 Decision Trees, or neural networks performs best. You
should create multiple (e.g., 30 or so) different synthetic data sets of both types, and then try 10-fold CV using each classifier
in Weka. You can receive up to 4 points extra credit for this question. Plus, if your data makes a convincing case that your
homework #3 answers were correct and we marked them wrong, you should note this in your assignment and we will return those points
to you as well. Turn in your code that generated the data, along with a brief report on your results. (4 points)
- The Google Code Jam qualification round happens March 11. This contest is not
hugely relevant to machine learning, but it is a fun programming exercise. For extra credit, register for the contest. Unfortunately
the qualification round for 2014 doesn't happen right away, so for now you can practice on the qualification round from last year.
Try one or more of the four questions from 2013's qualification round, and report on which ones you answered and whether
you got them right. Include your code.(2 points)
Version History |
1.0 |
Thu Mar 6 12:14:13 CDT 2014 |
Initial version. |