0) Clone the GitHub repository:
https://classroom.github.com/a/ee39p7RU
Note, part 1 of this HW must be done from a terminal window (not in a Jupyter notebook, but you could use the terminal window in Jupyter). Part 2 can either be done in a a Python program or a Jupyter notebook.
...
Note: you should not wait until the last possible day to do this problem. It requires that you get 100 nodes on Rivanna and you may have to wait for those slots (depending on how busy the cluster is, and your priority). I suggest doing starting this problem early! “The cluster was too busy” is not a valid excuse (although I often hear it in research!).
...
The goal is to run 100 jobs with a combined result of better precision on Pi (I was able to get ~1/106).
In order to To do this you need to:
1) Verify that you get the same numbers if you run your program twice with the same seed, you get the same numbersrandom number seed.
2) Verify that you get different answers if you run your program twice with different seeds, you get different answersa different random number seed.
3) Benchmark your pi program. Make the printed output of your program a single line: N_total, N_circle, pi, seconds, seed.
...
5) Write a SLURM script [pi_slurm.sh] to execute your job. Hint: you need a SLURM array! Start with the the one you used in class
6) Before running the 100 jobs submit a single job or two to make sure that your calculations were are reasonable(time is ~<1 hour, seed is working, output files are named properly…). This step is very important! You have a responsibility here to make sure you use the class resources properly. The assignment requires about 100 CPU hours. If you use more than 1000 CPU hours you will receive a large penalty on your score for this problem! Hint: Limit this job time in your slurm script to some reasonable amount larger than the time that you think you jobs should run (maybe 2 hours for a 1-hour job?).
...
Modify HW07.ipynb from your repository to complete the following tasks. Use pandas for all tasks where it is possible. Make sure to print all requested information clearly to the "screen". Since this is a notebook, you should pay attention to when it might be best to use the print function or when you might want to have it as the last line of code in a cell.
3) Pandas
Use pandas references to figure out how to complete the following tasks with the Iris dataset. (5 points)
- The iris data set is included in your HW10 repository. Load it Load the Iris dataset into a pandas DataFrame and print the top 10 rows to the screen.
How sHow many rows does it contain? How many columns?
Compute the average petal length and print it to the screen. Also, do this for each class.
Compute the average of each numerical column and print it to the screen.
Compute the average of each numerical column for each class of Iris and print it to the screen.
Extract the petal length outliers (defined as those rows whose petal length is more than 2 standard deviations away from the mean average petal length for the full set of data). Print these rows to the screen.
Compute the standard deviation of all columns and for each iris species.
- Extract the petal length outliers (i.e. those rows whose petal length is more than 2 standard deviations away from the mean average petal length for each class of Iris). There are many ways to do this, you may want to explore:
groupby()
,aggregate()
, andmerge()
. Print these rows to the screen. Investigate seaborn.pairplot and use it to make the pairplot for the Iris dataset. Save the pairplot as Iris.pairplot.png.
Want to be an A student? Make the pairplot again, but this time draw the outliers from part 8 in a different color on the off-diagonal scatter plots. Hint - you may need to make some new class types in your pandas DataFrame.
Push your results to Github, [MC_pi.py], [ combine_pi.py], pi_100.png, and pi_slurm.sh, and HW07.ipynb.