Work with a partner!
Goals
- Learn how to submit a "batch" job.
- For some of you, this will very useful in research. So, put some effort in, and figure out how it works.
0) Get the GitHub repository and clone it to your area in Rivanna.
Use this link to accept the assignment and create your repository on GitHub: https://classroom.github.com/a/9MFdUZNa
0. SLURM
Before getting started please make yourself familiar with the SLURM documentation page for Rivanna:
https://www.rc.virginia.edu/userinfo/rivanna/slurm/
1. Batch Computing
Much of scientific computing is done in a “batch” (non-interactive) computing environment. In these environments many jobs can be run in parallel - improving the “wall time” required to complete a job by a large factor.
That is, imagine if your 1000-hour job can be run on 1000 CPUs - it will be done in only 1 hour! In today's lab, we will learn how to create and submit a batch job to Rivanna.
Earlier this semester we used a MC method to calculate Pi. For this class, I've included one of my versions of that code in the repository: MC_pi.py
Take a look at it, and make sure you can run it and understand how it works.
Submitting a “Batch” Job
MC_pi.py takes two arguments at the command line: the number of trials, and an integer for the random number seed.
From the head nodes, we can submit batch jobs to run non-interactively on the compute nodes. Let's try a simple job. First you need to make a slurm script (I called mine myslurm.sh - it is just a batch script).
Take a look at myslurm.sh and see if you can understand what it is doing. Basically, it sets some parameters for the batch job (like which allocation to use, and some other information about the job), and then it says what command to run on the remote system. In our case:
python ./MC_pi.py 1000000 2
On the command line type:
sbatch myslurm.sh
It should submit a batch job that will run your program. It should run very quickly. This command sends your 'job' (MC_pi.py) to another computer and runs it there. The results from your jobs should appear as log files in the directory from where you ran the job. Were any output files produced in your directory?
Modify your slurm script so that the total number of points at a large enough value to take ~2 minutes. Do this by checking how long it takes to run MC_pi.py locally with some value of N, then figure out what factor, F, to multiply N by so that you get about 2 minutes. Then put this new value (F*N) into myslurm.sh. This job should take long enough to try some simple diagnostics while it is running. For example, try this right after you submit it:
squeue -u your_username
Review and try some of the other diagnostic checks here under “Displaying Job Status” (run it again if you run out of time): https://www.rc.virginia.edu/userinfo/rivanna/slurm/
Submitting a Meaningful “Batch” Job
The job above could be submitted many times, but it would do that same thing (unless you changed the random number seed by hand). Ideally, we want to be able to run many copies of our jobs but do something slightly different with each one. For example, each time with a different random seed so that the results are different. To do this we can use a slurm array!
myslurm_array.sh is another slurm script, but this time it uses an array.
This script does a few new things: It splits the output into a ”.out” for standard out information, and a ”.err” for standard error. It also names these files based on the job ID and the task ID. That way they get unique names so that they aren't ever overwritten by other job sections! When it executes the “MC_pi” program, it uses the “array_task_id” as the input parameter for the program to set the random number seed. So, you can submit jobs with multiple sections, and each one will have a unique random seed and the results will go to a different output file! Read that again, it is very important.
This time the main command run is:
python MC_pi.py 10000000 ${SLURM_ARRAY_TASK_ID} #This task ID will provide a unique random number seed for each section!!!
For example:
>>> sbatch --array=15,150 myslurm_array.sh
will run a job with 2 sections: one with the input seed set to 15 and one with input seed set to 150. So, the results will be different, and the result could be combined.
Try it and observe which files appear in your directory.
Here is the syntax for submitting an array with a range of values (between 1 and 4 in this case).
>>> sbatch --array=1-4 myslurm_array.sh
Run that one too. Use the diagnostic tools to check on it...
Done? Push your .out files to GitHub.
2) Do the first part of HW07 - submit your batch jobs...
Submit a test job
Prof. Group and Isabella, can help you make sure that the jobs you submit are reasonable.