0) Clone the GitHub repository:
https://classroom.github.com/a/K47aaGoE
Note, part 1 of this HW must be done from a terminal window (not in a Jupyter notebook, but you could use the terminal window in Jupyter). Part 2 can either be done in a a Python program or a Jupyter notebook.
Batch jobs on Rivanna: Calculating Pi to high precision
In decimal form, the value of pi is approximately 3.14. But pi is an irrational number, meaning that its decimal form neither ends (like 1/4 = 0.25) nor becomes repetitive (like 1/6 = 0.166666...). (To only 18 decimal places, pi is 3.141592653589793238.)
NOTE: For our class we only have about 20,000 CPU hours allocated to us. This assignment should take less than 100 CPU hours. Be responsible!
1) Submit Batch jobs: (3 Points)
Note: you should not wait until the last possible day to do this problem. It requires that you get 100 nodes on Rivanna and you may have to wait for those slots (depending on how busy the cluster is, and your priority). I suggest doing this problem early! “The cluster was too busy” is not a valid excuse (although I often hear it in research!).
...
7) Submit 100 jobs. Don't worry if a few sections fail (⇐5% failure rate is acceptable).
2) Combine Results: (2 Points)
1) Write a program [combine_pi.py] to combine the results - loop over 100 output files, add up all of the N, and N_circle, and calculates the combined result. Also sum up the total number of seconds. The output of this program should be the same as the one above (except you should use hours instead of seconds): N_total, N_circle, pi, hours. Is your value correct to the 6th position beyond the decimal? 3.141592653
...
Indeed, the uncertainty on the mean is about 10x smaller than the sigma of the distribution. Why did this come out Gaussian? It is not magic - just statistics! Another win for the Central Limit Theorem!
Modify HW11.ipynb from your repository to complete the following tasks. Use pandas for all tasks where it is possible. Make sure to print all requested information clearly to the "screen". Since this is a notebook, you should pay attention to when it might be best to use the print function or when you might want to have it as the last line of code in a cell.
3) Use pandas references to figure out how to complete the following tasks with the Iris dataset. (5 points)
- The iris data set is included in your HW10 repository. Load it into a pandas DataFrame and print the top 10 rows to the screen.
How many rows does it contain? How many columns?
Compute the average petal length and print it to the screen. Also, do this for each class.
Compute the average of each numerical column and print it to the screen.
Compute the average of each numerical column for each class of Iris and print it to the screen.
Extract the petal length outliers (defined as those rows whose petal length is more than 2 standard deviations away from the mean average petal length for the full set of data). Print these rows to the screen.
Compute the standard deviation of all columns and for each iris species.
- Extract the petal length outliers (i.e. those rows whose petal length is more than 2 standard deviations away from the mean average petal length for each class of Iris). There are many ways to do this, you may want to explore:
groupby()
,aggregate()
, andmerge()
. Print these rows to the screen. Investigate seaborn.pairplot and use it to make the pairplot for the Iris dataset. Save the pairplot as Iris.pairplot.png.
Want to be an A student? Make the pairplot again, but this time draw the outliers from part 8 in a different color on the off-diagonal scatter plots. Hint - you may need to make some new class types in your pandas DataFrame.
...