There are three ways to monitor the overall status of the PACMAN or HiPAS cluster. Text Mode is the quickest.
Text Mode
If you do not have a DISPLAY
redirected to your local machine you can monitor the cluster by typing beostatus -c
. You will see a screen very similar to top
that will automaniclly update the screen with information about the nodes.
GUI Mode
If you do have a display redirected you can use the graphical monitoring tool. You can acccess it by typing beostatus
. After a short delay you will have a nice Colored GUI you can use to monitor various aspects of the cluster. You can change the style of the graphs by clicking on the Mode
menu. You can exit the program by either closing the window, or clicking on the File
menu and selecting Quit
Monitoring Job Status
Check the status of your job using qstat
. Here's an example with output:
$ qsub myjob && watch qstat -n master: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 15.hipas bjosh default myjob -- 1 -- -- 00:01 Q -- -- |
The watch
command is used to execute the qstat -n
command every 2 seconds by default. This will help you see the progression of events. Press Control-C to interrupt watch
.
Some Helpful commands
Command | Purpose |
---|---|
ps -ef | bpstat -P |
Display all running jobs, with node number for each. |
qstat -Q |
Display status of all queues. |
qstat -n |
Display status of queued jobs. |
qstat -f JOBID |
Display very detailed information about JOBID. |
qstat -Q -f |
Display status of all queues in more detail. |
pbsnodes -a |
Display status of all nodes. |
How to Find Which Nodes Your Job is Using
qstat -an
Note your jobid(s).
qstat -f jobid
Note the process id(s) of your job(s).
ps -ef | bpstat -P | grep yourname
The number of the node running your job will be displayed in the first column of output.
Where To Find Job Output
When your job terminates, Torque will store its output and error streams in files in the script's work directory.
The output file is [JOBNAME].o[JOBID]
by default. You can override that using the qsub -o PATH
option.
The error file is [JOBNAME].e[JOBID]
by default. You can override that using the qsub -e PATH
option.
The qsub -j oe
option can be used to join the output and error streams into a single file.