Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Use this link to accept the assignment and create your repository on GitHub:  


1) Unsupervised Learning!

Modify class25.ipynb from your repository to complete the following tasks.


Use pandas for all tasks where it is possible. Make sure to print all requested information clearly to the "screen". Since this is a notebook, you should pay attention to when it might be best to use the print function or when you might want to have it as the last line of code in a cell. 


In class, we have focused on Supervised Machine Learning Classification tasks.  Now, will explore an Unsupervised ML method for cluster finding.   


Imagine that you are given a dataset that Includes an unknown number of different classes?  You want to figure out how many distinct classes exist in this dataset.  An unsupervised training algorithm, one that can find clusters in N dimensions, is what you want to use!

...

  This is called a clustering algorithm.   Scikit Learn includes the KMeans clustering algorithm, and it can be used for this task.  


KMeans is described Chap.5 of Vanderplas in section  "In Depth: k-means clustering".  You can find documentation here:

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html


Use Pandas and KMeans references to figure out how to complete the following tasks with the Iris dataset.  Each part is worth 1/2 point.  

  1. The Iris dataset is included in your repository. Load it into a Pandas DataFrame. 
  2. Use Seaborn to make the pairplot for the Iris dataset. 
  3. Make a new DataFrame removing the classification information from the Iris dataset and print 5 rows to the screen. Forget that there are three distinct classes there!  The rest of this assignment is about using the date to discover how many classes are in it!  Hint: consider pandas.DataFrame.loc. 

  4. The number of clusters must be specified before running the kMeans algorithm.  Since, in principle, we would not know how many clusters is the right number for an arbitrary dataset (and we forgot how many are in the Iris dataset), we need to try many values.  Write a loop that runs the kMeans clusters assuming from  1-10 as the number of clusters.  For each time through the loop, after the fit, store the   "kmeans.inertia_" variable for analysis. "inertia" is a measure of the quality of the clustering - it is calculated as the Sum of squared distances of samples to their closest cluster center.  Explain why this is a reasonable figure of merit for the quality of the clustering result.

  5. Make a plot of the inertia v/s the number of clusters used in the kMeans algorithm. 

  6. Google "Elbow method for finding number of clusters", and apply it to your figure.  Is it consistent with 3 clusters?  Explain.

  7. Assuming that you can justify three clusters, run the fit again for the case of three clusters.  Make scatter plot of petal width v/s petal length for the 3 clusters obtained. 

  8. Draw a scatter plot of petal width v/s petal length for the 3 Iris classes for the actual classes in Iris.  Does it look the same as the figure in 7?  Did clustering work well?  Also draw on the plots the cluster centers found above.  Use a large star for the point representing the cluster center.

  9. Compare the mean values for each feature with the cluster center values for the relevant cluster.  Are they close?  
  10. Extra challenge!   Try this..  Using techniques similar to those in class, compare how each point was clustered to its true Iris class.  Were any samples put into the wrong class according to the clustering algorithm?  Discuss the features of the samples that were not clustered properly. 

    Make sure to add, commit, and push your repository to GitHub



Done?  Make sure your plot and work are cleaned up and push files to GitHub.