This document describes the "mechanics" of using an MPI program on Virginia Tech's System X. It assumes that an MPI program has already been written.
Refer to An Introduction to MPI for a very brief overview of MPI.
This document will simply walk you through a typical series of steps that take an MPI program on your "home" machine, run it on System X, and bring the results back home.
At the end of this document are instructions on how to actually carry out these steps, using sample files available on the web.
For this simple introduction, we'll make a number of assumptions.
(One file, simple name): We'll assume you have a program, already written, which uses MPI, that the program consists of a single file, written in C, and that this file is called program.c.
(No input/Only standard output): We'll also assume for now that the program needs no input, and that the output of the program is entirely directed to the standard output device. In other words, the executing program does not read from or write to any auxilliary files.
(Source code on home machine): We'll assume this source code file is sitting in the source_code subdirectory on your home machine home_mac.
Our goal then is to transfer the file to System X, compile it, run it, and retrieve the output.
System X comprises 1100 nodes, each containing two processors. Most of these nodes are compute nodes, which are used exclusively for computation. But a few nodes, known as compile nodes, are set aside for interactive use, allowing users to create file directories, store files, compile them, submit jobs, and so on.
The System X compile nodes we are interested in have the following IP addresses:
To compile your program, you will need to transfer the source code of your MPI program to one of these nodes. This can be done with the secure FTP program sftp. Here is a typical session, which suggests how you might transfer the file. We are assuming here that you already set up a subdirectory on System X called work_directory.
home_mac: sftp sysx1.arc.vt.edu
sysx1: Password for user: xxxxx
sysx1: cd work_directory
sysx1: lcd source_code
sysx1: put program.c
sysx1: ls
sysx1: program.c
sysx1: quit
home_mac:
Note that the commands cd, pwd and ls are carried out on the remote machine (sysx1 in this case) while the corresponding commands lcd, lpwd and lls will be carried out on the local machine (home_mac in this example). The put command moves files from the local to the remote machine, while the get command brings files from the remote machine to the local one. If multiple files are to be transferred, the mget and mput commands can be used.
Once the source code file has been transferred to one of the compile nodes, you can log in to the compile node and compile your file. Since there is a single file server shared by all the compile nodes, you can log in to any one of the compile nodes you like, and you will see the same set of files.
To log in interactively, we use the Secure Shell program, ssh.
home_mac: ssh sysx2.arc.vt.edu
sysx1: Password for user: xxxxx
sysx1: cd work_directory
sysx1: mpicc program.c
sysx1: mv a.out program
Note that you must use the mpicc compiler to compile a C program that invokes the MPI library. If the compile command fails because the mpicc command cannot be found, you may need to invoke it with the full path name:
/nfs/compilers/mpich-1.2.5/bin/mpicc program.c
If the compilation fails, you will need to revise your program. You can either edit the program on your home machine and transfer it again, or make the changes directly on the System X copy.
In our example, we assume the compilation was successful. We allowed the compiler to assign the default name of a.out to the executable program it created, and then we renamed it to program. We're now ready to submit the program to execution, so we're staying logged in.
Once the executable program has been created, you need a shell script to run the program in parallel. This script specifies the number of processors to be used, the time limit, and so on. An example of such a shell script, with explanatory comments, is available in the System X file system as "/nfs/docs/qsub-example.sh" or you can refer to this copy of qsub-example.sh.
Here is a simplified shell script for our example, which we will call program.sh.
#!/bin/bash
#
#PBS -lwalltime=00:00:30
#PBS -lnodes=2:ppn=2
#PBS -W group_list=???
#PBS -q production_q
#PBS -A $$$
NUM_NODES=`/bin/cat $PBS_NODEFILE | /usr/bin/wc -l | /usr/bin/sed "s/ //g"`
cd $PBS_O_WORKDIR
export PATH=/nfs/software/bin:$PATH
jmdrun -printhostname -np $NUM_NODES -hostfile $PBS_NODEFILE \
./program &> program_output.txt
exit;
Replace the "???" field in this file by your group information. To get your group, log into one of the System X compile nodes and type
groups
Ignore the "staff" group in the output; use the other group
that is listed as the value of the "???" filed in the shell script.
Also replace the "$$$" field by your "hat", that is, the account to which your computer work is to be billed. The hat value was assigned when your project was approved and set up for System X.
We'll assume that the shell script program.sh is stored in work_directory, the subdirectory which contains program.c and the executable program. To run the job, we must "submit" the shell script to the queuing system. To do this, we must move to the subdirectory containing the job script and the executable (which we're assuming is subdirectory work_directory), and issue the command:
qsub program.sh
The qsub command asks the queing system to schedule your job to run. The immediate response from the queueing system is a message that assigns a job number. The job number can be used to check on the progress of your job, and it will also be used as part of the name of the log files created when your job is done.
For example, the response to your qsub command might be
40316.queue.tcf-int.vt.edu
in which case your job number is 40316.
Although our example job is small (only 30 seconds on 4 processors) and should run quickly, it is always possible to check on the status of all the jobs you have in the queue, by issuing the command
showq | grep YOUR_NAME
which might show you
JOBID USERNAME STATE PROCS REMAINING STARTTIME
40316 your_name Idle 4 00:00:30 Mon Oct 15 14:06:00
You can also use the convenient command
qstat -u YOUR_NAME
whose output format is a little different
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----
40316.queue.tcf-int. YOUR_NAM producti program -- 2 1 -- 00:00 Q --
This command gives you information about the number of nodes requested, the amount of
time and memory requested and so on. The "S" (for "status") field lists a
value of "Q", which means the job has been queued, but has not started to run.
(Note that the output under each heading is truncated if it is long).
When the shell script is processed, and it is time to run the executable, then we specified that the output of the executable was to go to the file: program_output.txt. When you see this file created in your directory, you know the program has begun to execute - however, you can't assume the program is done yet. The program is done executing (and all the commands in the shell script are completed) when you see the standard output and standard error log files appear in the directory.
For our example, these files would have the names
program.sh.o40316
program.sh.e40316
because they are the standard output "O" and standard error "E"
associated with the run of program.sh which had been assigned
the job number 40316. If you redirected the output of your
executable program to a file (we did) then typically these log files
won't contain anything of interest. However, if your job ran out of
time, or had a run time error, for instance, this information would
be stored in the standard error log file.
Assuming the job executed satisfactorily, you can examine the results or pull them back to your home machine using the sftp program:
home_mac: sftp sysx3.arc.vt.edu
sysx3: Password for user: xxxxx
sysx3: cd work_directory
sysx3: lcd source_code
sysx3: get program_output.txt
sysx3: lls
sysx3: program.c program_output.txt
sysx3: quit
home_mac:
Sample files are available, so that you can try out the procedures for file transfer, compilation, job submission, and output file recovery.
The implementation of MPI used on System X is known as MPICH. System X is currently using version 1.2.5 of MPICH. MPICH was developed at Argonne National Laboratories, which maintains the MPICH web site at http://www-unix.mcs.anl.gov/mpi/mpich1/.
Compilers available on System X include:
The qsub command used to submit jobs is part of the Portable Batch System or "PBS". The version of PBS installed on System X is known as the TORQUE Resource manager. TORQUE was developed by a number of supercomputing centers. The web site for TORQUE is http://www.clusterresources.com/pages/products/torque-resource-manager.php.