Frequently Asked Questions (FAQ)

High Performance Computing

The UA HPC offers free high performance computing resources in the Research Data Center (RDC), a state-of-the-art facility that hosts large computer clusters. Accounts are available for all University faculty for the purpose of research; faculty can sponsor accounts for staff, students, and visiting scholars. To get started, go toIT High Performance Computing or HPC Documentation

LPL's Planetary Atmospheres Computing and Modeling ANalysis (PACMAN) is a 28-node (448-cpu) cluster. PACMAN is available to the Barman research groups.

LPL's High Performance Astrophysics Simulator (HiPAS) is a 48-node (384-cpu) cluster. HiPAS is available to the Giacalone, Malhotra, and Yelle research groups.

PIRL (Planetary Image Research Laboratory) HPC clusters are available for use by those who have a PIRL user account. To get started, visit PIRL Laboratory Resources.

For extensive documentation on using the UA HPC, go to UA HPC Documentation.

PACMAN and HiPAS use the Torque system for queueing and Maui for scheduling batch jobs. The goal is to allocate our limited computing resources to users, on demand, as fairly as possible.

Run your jobs on the compute nodes using qsub.

Jobs running on the head node will be killed by the systems administrators.

Run your jobs from your cdata directory, not from your hipas home directory.

Using Torque

To use Torque, simply put the commands you would normally use run your job into a job script, and submit the job script to the cluster using qsub. You should refer to man qsub for more detailed information as you read the following overview.

Also, the man page for qsub is available online on Adaptive Computing's qsub man page.

The qsub program has a lot of options which may be supplied on the command line, or as special directives inside the PBS job script.

Example Job Script

The following job script declares a job having the name myjob and requiring one node. It then changes to the work directory, and sends the execution host name, current date, and working directory to standard output.

#!/bin/sh

## Set the job name
#PBS -N demo_job
#PBS -l nodes=1

# Run my Job
beorun --nolocal --np 1 /path/to/my/job

echo Host: $HOSTNAME
echo Date: $(date)
echo Dir: $PWD

Assuming the above job script is in a file called myjob, you would submit it as follows:

[bjosh@hipas]$ qsub myjob
15.hipas

Note that qsub returns the Job ID immediately, although the job is simply queued to run at some future time to be decided by the scheduler. The Job ID is an incrementing integer followed by the name of the submit host.

Equivalent Job Started From Command Line

You are not required to use job scripts. You could instead type all the options and commands at the command line. However, job scripts make it easier to manage your actions and their results. Following is the equivalent command line version of the above job script.

[bjosh@hipas]$ qsub -N myjob -l nodes=1:ppn=1 -j oe
cd $PBS_O_WORKDIR
echo Host: $HOSTNAME
echo Date: $(date)
echo Dir: $PWD
^D
15.master

We entered all of the qsub options on the initial command line. The qsub read our job commands line by line until we typed Control-D, the end of file character. At that point, qsub queued the job and returned the Job ID to us.

A More Complex Job Script Using MPICH

TODO

Checking Job Status

Check the status of your job using qstat. Here's an example with output:

$ qsub myjob && watch qstat -n
 
master:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
15.hipas         bjosh   default  myjob         --    1  --    --  00:01 Q   --
    --

The watch command is used to execute the qstat -n command every 2 seconds by default. This will help you see the progression of events. Press Control-C to interrupt watch.

Some Helpful commands

Command Purpose
ps -ef | bpstat -P Display all running jobs, with node number for each.
qstat -Q Display status of all queues.
qstat -n Display status of queued jobs.
qstat -f JOBID Display very detailed information about JOBID.
qstat -Q -f Display status of all queues in more detail.
pbsnodes -a Display status of all nodes.

How to Find Which Nodes Your Job is Using

qstat -an
Note your jobid(s).

qstat -f jobid
Note the process id(s) of your job(s).

ps -ef | bpstat -P | grep yourname
The number of the node running your job will be displayed in the first column of output.

Where To Find Job Output

When your job terminates, Torque will store its output and error streams in files in the script's work directory.

The output file is [JOBNAME].o[JOBID] by default. You can override that using the qsub -o PATH option.

The error file is [JOBNAME].e[JOBID] by default. You can override that using the qsub -e PATH option.

The qsub -j oe option can be used to join the output and error streams into a single file.

There are three ways to monitor the overall status of the PACMAN or HiPAS cluster. Text Mode is the quickest.

Text Mode

If you do not have a DISPLAY redirected to your local machine you can monitor the cluster by typing beostatus -c. You will see a screen very similar to top that will automaniclly update the screen with information about the nodes.

GUI Mode

If you do have a display redirected you can use the graphical monitoring tool. You can acccess it by typing beostatus. After a short delay you will have a nice Colored GUI you can use to monitor various aspects of the cluster. You can change the style of the graphs by clicking on the Mode menu. You can exit the program by either closing the window, or clicking on the File menu and selecting Quit

Monitoring Job Status

Check the status of your job using qstat. Here's an example with output:

$ qsub myjob && watch qstat -n
 
master:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
15.hipas         bjosh   default  myjob         --    1  --    --  00:01 Q   --
    --

The watch command is used to execute the qstat -n command every 2 seconds by default. This will help you see the progression of events. Press Control-C to interrupt watch.

Some Helpful commands

Command Purpose
ps -ef | bpstat -P Display all running jobs, with node number for each.
qstat -Q Display status of all queues.
qstat -n Display status of queued jobs.
qstat -f JOBID Display very detailed information about JOBID.
qstat -Q -f Display status of all queues in more detail.
pbsnodes -a Display status of all nodes.

How to Find Which Nodes Your Job is Using

qstat -an 
Note your jobid(s).

qstat -f jobid 
Note the process id(s) of your job(s).

ps -ef | bpstat -P | grep yourname 
The number of the node running your job will be displayed in the first column of output.

Where To Find Job Output

When your job terminates, Torque will store its output and error streams in files in the script's work directory.

The output file is [JOBNAME].o[JOBID] by default. You can override that using the qsub -o PATH option.

The error file is [JOBNAME].e[JOBID] by default. You can override that using the qsub -e PATH option.

The qsub -j oe option can be used to join the output and error streams into a single file.

MPICH is a freely available, portable implementation of MPI, the standard for message-passing libraries.

Building Your Applications

To build applications with MPICH, you should replace direct references to compilers (gcc, g77, etc.) with references to the appropriate MPICH wrapper scripts (mpicc, mpif77, etc.). The wrapper scripts are intended to supply correct include and library paths and options automatically. The wrapper script names are:

  • mpicc the C compiler wrapper for gccpgccicc.
  • mpif77 the Fortran 77 compiler wrapper for f77pgf77ifc.
  • mpif90 the Fortran 90 compiler wrapper for pgf90f90com.
  • mpiCC the C++ compiler wrapper for g++pgCCicc.

To troubleshoot problems compiling with a wrapper script, use the -show argument to see what commands it is executing, and try executing those commands manually. Try running mpicc -show alone to see the general effect of the wrapper script. Here's an example:

% mpicc -show test.c -o test
gcc -DUSE_STDARG -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STDARG_H=1 -DUSE_STDARG=1
-DMALLOC_RET_VOID=1 -L/usr/local/mpich/1.2.5.2/gcc/i686/lib  test.c -o test -lmpich

Running Your Applications

Start MPICH programs using the mpirun wrapper script.

The most common mpirun arguments are briefly described below:

Argument Purpose
-np N Request quantity N processors.
-nolocal Do not run the job on the local node (example: master).
-show Show what mpirun would do, but don't actually do it. Useful for troubleshooting.

For more information, please read man mpirun. Also note that some options have no effect on execution because they don't apply to the cluster's configuration.

Order Matters When Compiling and (especially) Linking

The order of arguments, flags, libraries, and files matters when compiling and linking; we can must be careful to link the MKL libraries in the correct order. 

Putting the include directories (e.g. "-I /opt/intel/include") in the correct place in the command line may matter as well. 

Here are a few pages about why that matters: 

So, the Makefile needed some adjusting to get all its flags in the correct order. 

MKL Libraries: Care Needed

There are a few dozen separate MKL libraries. There are a few main libraries times several versions per library (e.g. for different processors, different thread models, etc.). Care is needed to know which specific libraries are needed for a given project. Some sources claim that it's as easy as adding "-mkl" to the compile command...but this doesn't appear to hold true in practice. Fortunately, Intel offers an online MKL Link Line Assistant tool to figure out what the linking line should look like. It really does help — I would encourage anyone using the MKL libraries to use this tool early in their work. It helps in two ways: first, it shows the correct library order (see above); second, it helps one know which versions of the libraries are needed (e.g. libmkl_lapack95.a vs. libmkl_lapack95_lp64.a vs. libmkl_lapack95_ilp64.a). 

Version of MKL and Setting Environment Variables

Anyone using any of Intel's compilers (icc, ifort) and/or MKL should make sure to run the compilervars.sh (or equivalent) script. Intel has an article about it on Intel's "Your Official Source for Developing on Intel® Hardware and Software". Here's what they need to know about it: 

  • Each installed version of Intel's software (e.g. 11.0 vs. 11.1 on the cluster) has its own compilervars.sh script. Either should work fine. Usually, one should use the most up-to-date version.
  • When running it, one must include the processor argument (intel64 for the cluster and most everyone's workstations). The following should work on the cluster in descending order of specificity, though they should all have the same effect: 
    • source  /opt/intel/bin/compilervars.sh -arch intel64 -platform linux
    • source  /opt/intel/bin/compilervars.sh -arch intel64
    • source  /opt/intel/bin/compilervars.sh intel64
  • That script should be run prior to any compilation/linking of code. (!!!!)
  • It only needs to be run once per session to set environment variables correctly — not every time ifort is called. 
  • Running that script in one terminal affects only that terminal session. If multiple terminal windows are open, a user might want to run it in each terminal. 
    • Better yet, add it to one's .bashrc or .cshrc file and have it run automatically for each session.