0) Clone the GitHub repository:
https://classroom.github.com/a/ee39p7RU
Note, part 1 of this HW must be done from a terminal window (not in a Jupyter notebook, but you could use the terminal window in Jupyter). Part 2 can either be done in a Python program or a Jupyter notebook.
Batch jobs on Rivanna: Calculating Pi to high precision
In decimal form, the value of pi is approximately 3.14. But pi is an irrational number, meaning that its decimal form neither ends (like 1/4 = 0.25) nor becomes repetitive (like 1/6 = 0.166666...). (To only 18 decimal places, pi is 3.141592653589793238.)
NOTE: For our class we only have about 20,000 CPU hours allocated to us. This assignment should take less than 100 CPU hours. Be responsible!
1) Submit Batch jobs: (3 Points)
Note: you should not wait until the last possible day to do this problem. It requires that you get 100 nodes on Rivanna and you may have to wait for those slots (depending on how busy the cluster is, and your priority). I suggest starting this problem early! “The cluster was too busy” is not a valid excuse (although I often hear it in research!).
In class20 we used a MC method to calculate Pi and we submitted some test jobs to the batch system on Rivanna, MC_pi.py. Now, we will submit batch jobs to calculate pi at a higher precision. Review the course slides and/or in-class work for details on submitting batch jobs.
The goal is to run 100 jobs with a combined result of better precision on Pi (I was able to get ~1/106).
To do this you need to:
1) Verify that you get the same numbers if you run your program twice with the same random number seed.
2) Verify that you get different answers if you run your program twice with a different random number seed.
3) Benchmark your pi program. Make the printed output of your program a single line: N_total, N_circle, pi, seconds, seed.
4) Investigate values of N_total to figure out what value would take ~1/2 hour to run on a compute node. (Note you don't need to run a 1/2-hour job to do this - a 30 second job is fine, and then you can calculate what N would run in about an hour).
5) Write a SLURM script [pi_slurm.sh] to execute your job. Hint: you need a SLURM array! Start with the one you used in class
6) Before running the 100 jobs submit a single job or two to make sure that your calculations are reasonable(time is ~<1 hour, seed is working, output files are named properly…). This step is very important! You have a responsibility here to make sure you use the class resources properly. The assignment requires about 100 CPU hours. If you use more than 1000 CPU hours you will receive a large penalty on your score for this problem! Hint: Limit this job time in your slurm script to some reasonable amount larger than the time that you think you jobs should run (maybe 2 hours for a 1-hour job?).
7) Submit 100 jobs. Don't worry if a few sections fail (⇐5% failure rate is acceptable).
2) Combine Results: (2 Points)
1) Write a program [combine_pi.py] to combine the results - loop over 100 output files, add up all of the N, and N_circle, and calculates the combined result. Also sum up the total number of seconds. The output of this program should be the same as the one above (except you should use hours instead of seconds): N_total, N_circle, pi, hours. Is your value correct to the 6th position beyond the decimal? 3.141592653
2) In [combine_pi.py] add a histogram for the distribution of Pi values obtained in your 100 jobs. Note: the values should be very close together!
3) Fit this distribution of the 100 estimates of pi to a Gaussian. Extract the mean and sigma as well as the uncertainty on the mean. The sigma is a measure of the uncertainty on as single result from one of your 100 sections. The uncertainty on the mean is a measure of the uncertainty of the combined result for pi. The uncertainty on MC integration is related to 1/sqrt(N). So, since you ran 100 sections the error of the combined result should be about 10 times smaller than the uncertainty on one of your job sections. Comment on this with a comment added to the header of your MC_pi.py program. Save this image as a png file [pi_100.png].
My result looks like this:
Indeed, the uncertainty on the mean is about 10x smaller than the sigma of the distribution. Why did this come out Gaussian? It is not magic - just statistics! Another win for the Central Limit Theorem!
Modify HW07.ipynb from your repository to complete the following tasks. Use pandas for all tasks where it is possible. Make sure to print all requested information clearly to the "screen". Since this is a notebook, you should pay attention to when it might be best to use the print function or when you might want to have it as the last line of code in a cell.
3) Pandas
Use pandas references to figure out how to complete the following tasks with the Iris dataset. (5 points)
- Load the Iris dataset into a pandas DataFrame and print the top 10 rows to the screen.
sHow many rows does it contain? How many columns?
Compute the average petal length and print it to the screen. Also, do this for each class.
Compute the average of each numerical column and print it to the screen.
Compute the average of each numerical column for each class of Iris and print it to the screen.
Extract the petal length outliers (defined as those rows whose petal length is more than 2 standard deviations away from the mean average petal length for the full set of data). Print these rows to the screen.
Compute the standard deviation of all columns and for each iris species.
- Extract the petal length outliers (i.e. those rows whose petal length is more than 2 standard deviations away from the mean average petal length for each class of Iris). There are many ways to do this, you may want to explore:
groupby()
,aggregate()
, andmerge()
. Print these rows to the screen. Investigate seaborn.pairplot and use it to make the pairplot for the Iris dataset. Save the pairplot as Iris.pairplot.png.
Want to be an A student? Make the pairplot again, but this time draw the outliers from part 8 in a different color on the off-diagonal scatter plots. Hint - you may need to make some new class types in your pandas DataFrame.
Push your results to Github, MC_pi.py, combine_pi.py, pi_100.png, pi_slurm.sh, and HW07.ipynb.