CONDOR
A Job Control System


CONDOR is a job control system which can accept jobs from a variety of users, and assign them for execution on one or more computers that it communicates with. In particular, if a user wishes to run a program in parallel, with MPI, CONDOR can find the necessary machines and manage the parallel execution of the job.

A job that is to be controlled by CONDOR must be "noninteractive". If the program would normally read from the keyboard, then a file of input must be prepared beforehand, and CONDOR must be told to use that file for input. If the program would normally display results to the terminal, then CONDOR must be told to save such results in an output file.

Master Nodes

A CONDOR job will run on one or more nodes (think of these as individual computers, each of which contains one or more processors) in a cluster. A cluster consists of a collection of cooperating nodes. There is one special node, called the master node from which you can submit jobs to be run on the cluster.

The nodes in a cluster usually share a common name, and are distinguished by number. The master node is usually (but not always!) the first named node. And usually, you can log into the master node by specifying the common name, without having to specify the number.

Thus, in particular, the phoenix cluster has 36 nodes, the master node is named phoenix000, the "next" node is phoenix001 and so on, and you can log into the master node by the simpler command


        ssh phoenix
      
rather than the more correct

        ssh phoenix000
      

SCS maintains three general access clusters. Most users will be interested in Phoenix:

There are several other clusters which can be used as well, but a user requires permission from the cluster owner before getting access:

The CONDOR Universe

CONDOR allows the user to choose an universe. You choose an universe based on which features of CONDOR you will be needing.

In the simplest universe, known as vanilla, CONDOR simply finds a suitable computer, sends the necessary files (input and executable program) to that machine, runs it, and returns the output.

The vanilla universe is a good way to start using CONDOR. For one thing, the other universes require you to recompile your program with a special CONDOR library. When you don't have that special library, you are giving up the ability to do parallel processing, checkpointing, and remote procedure calls, but for the simplest jobs, none of these features are necessary.

The standard universe allows you to submit a job to be run by CONDOR, but adds checkpointing and remote procedure calls. In order for these features to work, it is necessary that your executable program be compiled with the CONDOR libraries.

The MPI universe allows you to submit a job which is to run an MPI program on a given number of processors. At the moment, the compilers needed for MPI are in bizarre locations, and a suitable alias has not been set up for them. Therefore, you need to enter the appropriate commands to define the MPI directory, and to define the compiler you want to use:

        setenv MPIDIR /usr/local/mpich-1.2.4/ch_p4
        alias mpicc $MPIDIR/bin/mpicc
        alias mpiCC $MPIDIR/bin/mpiCC
        alias mpif77 $MPIDIR/bin/mpif77
        alias mpif90 $MPIDIR/bin/mpif90
        alias mpirun $MPIDIR/bin/mpirun
      

Before you submit a job in the MPI universe, you need to compile your program, using a command like


        mpicc myprog.c
        mpiCC myprog.C
        mpif77 myprog.f
        mpif90 myprog.f90
      
As part of your job, you also specify the number of processors on which the job is to be run. CONDOR takes care of finding a suitable number of available processors of the appropriate architecture; it copies your program to each of the processors, starts up the MPI process, and at the end of execution, gathers up the output files created by each process.

Submitting a job

To use CONDOR, the user prepares a submit description file, a text file which specifies the values of certain parameters, such as the name of the program to be run, the location of a file to be associated with standard input, a starting default directory, and so on.

Once a submit description file is prepared, the user may ask CONDOR to run the job, or more precisely, to manage the process of running the job. This is done by issuing the condor_submit command. For instance, if the submit description file was named foo.txt, then the user would type

condor_submit foo.txt

Of course, the condor_submit command can only be issued from a machine that is running CONDOR; however, the job doesn't have to run there; in fact, the user can request that the job actually be run on any suitable machine in the local collection of machines.

The user can also request that the job be run as an MPI job, in which case CONDOR is responsible for finding a suitable number of available processors on which the job can be executed.

The condor_submit command transfers the responsibility for running the job to CONDOR. But usually your job will not run right away. While you are waiting for results, you may be curious whether the job is still waiting for a suitable machine to be found, or has started, or is well on its way to completion. To find out the status of your jobs, issue the command

condor_q user_name

To find out the status of all jobs, issue the command

condor_q

Here is some of the output from the condor_q command:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 307.0   bleason         7/7  17:31   0+07:24:31 I  0   0.0  toy 1978          
 362.5   manowar         8/18 11:13  46+03:55:59 I  0   152  emigrate
 544.0   ely            10/7  13:46   0+02:25:51 R  20  2.1  sos        
 545.0   burdett        10/7  15:38   0+00:00:13 I  0   0.0  foo.csh           
      
Note, in particular, under the ST (status) column, that I indicates that the job is idle, that is, not running, while R means the job is currently running.

CONDOR Script Parameters:

universe = vanilla | standard | mpi
The universe parameter specifies how CONDOR is to be used.
initialdir = /home/your directory
This command specifies the directory in which your executable is stored. Note that most SCS directories will start with /a/fs.scs.fsu.edu/ but this should be replaced by /home.
executable = myprog
This command specifies the name of the executable program or shell script that is to be run.
log = logfile
This command specifies the name of a file into which CONDOR will write a running commentary of the process by which it set up and ran the job. This is occasionally useful if the job fails.
output = outputfile
This command specifies the name of the file into which the output should go. For MPI jobs, it is possible to specify a separate output file for each process, by including the symbol $NODE in the name, which will be replaced by the MPI processor number. Thus, you might specify an output file name of myprog_$(NODE).output.
machine_count = number
This command is only needed for MPI jobs, and specifies the number of computers to be used.
arguments = arg1 arg2 arg3
This command allows you to supply the command line arguments you would give if you were running your executable interactively.
queue
This command should be the last command in your submit file. It causes CONDOR to process your job.

Useful CONDOR commands

condor_compile cc | CC | f77 | javac myprog.c|C|f77|java
is how you compile a program when the executable is to be run under CONDOR's standard universe

condor_submit file_name
is how you submit a job.

condor_q
to find out the status of all jobs.

condor_q username
is how you find out the status of just your jobs.

condor_rm username
removes all jobs submitted by username from the queue, whether they are waiting, or executing.

Related Data and Programs:

MPI is a Message Passing Interface that makes it possible to write programs that run in parallel on many computers. Information about MPI, for users of a specific programming language, is available in a C version, or a C++ version , or a FORTRAN77 version, or a FORTRAN90 version.

Reference:

  1. condor.pdf,
    Condor Team,
    University of Wisconsin, Madison,
    Condor Version 6.6.10 Manual,
    the manual for Condor, a job submission system;
  2. http://www.cs.wisc.edu/condor/,
    The Condor home page;
  3. https://www.scs.fsu.edu/twiki/bin/view/TechHelp/UsingCondor,
    The SCS Condor page;

Examples and Tests:

FOO is the simplest example I could think of that would demonstrate the simplest use of CONDOR, to run a basis shell script in the "vanilla" universe. This only took me a week to get right.

GOO is an example job that uses the "standard" universe. This only took me 10 minutes to get right.

HOO is an example job that uses the "MPI" universe to run a very simple MPI program on 4 processors. This only took me 20 minutes to get right.

MOO is a simple example of how to run an executable compiled program (written in C, C++ or FORTRAN) using CONDOR. We assume that no MPI stuff is going on, and no checkpointing is being done. This is just the FOO example, but with a "more interesting" executable. To make this work, compile the program on Phoenix and submit the job on Phoenix. That will guarantee by default that the executable will be run on a machine just like the one from which the CONDOR job was submitted.

You can go up one level to the Examples page.


Last revised on 04 April 2006.