Quick Start Guide to Running Jobs on TSCC

To login to the TSCC, use the following hostname:

tscc-login.sdsc.edu

Following are examples of Secure Shell (ssh) commands that may be used to login to the TSCC:

ssh <your_username>@tscc-login.sdsc.edu
ssh -l <your_username> tscc-login.sdsc.edu

More information about Secure Shell may be found in the New User guide. SDSC security policy may be found at the SDSC Security site. Download the TSCC Quick Reference Guide [PDF].

Important Guidelines for Running Jobs

Please do not write job output to your home directory (/home/$USER). NFS filesystems have a single server which handles all the metadata and storage requirements. This means that if a job writes from multiple compute nodes and cores, the load is focused on this one server.
The Lustre parallel filesystem (/oasis/tscc/scratch) is optimized for efficient handling of large files, however it doesn't work nearly as well when writing many small files. We recommend using this filesystem only if your metadata load is modest – i.e., you have O(10)-O(200) files open simultaneously.
Use local scratch (/state/partition1/$USER/$PBS_JOBID) if your job writes a lot of files from each task. The local scratch filesystem is purged at the end of each job, so you will need to copy out files that you want to retain after the job completes.

Running Batch Jobs

Running Jobs with TORQUE

TSCC uses the TORQUE Resource Manager (also known by its historical name Portable Batch System, or PBS) with the Maui Cluster Scheduler to define and manage job queues. TORQUE allows the user to submit one or more jobs for execution, using parameters specified in a job script.

Job Queue Summary Descriptions

The intended uses for the submit queues are as follows:

hotel The hotel queue supports all non-contributor users of TSCC. Jobs submitted to this queue will use only the nodes purchased for the RCI project shared cluster. As such, the total number of cores running all hotel jobs is limited to the total number of nodes in that cluster (currently 640).
home This is a routing queue intended for all submissions to group-specific clusters; if you intend for your job to run only within the nodes you have contributed, submit to this queue. Some users may belong to more than one home group; in this case, a default will be in effect, and using a non-default group will be specified in the job submission.
condo The condo queue is exclusive to contributors, but allows jobs to run on nodes in addition to those purchased. This means that more cores can be in use than were contributed by the project, but it also limits the run time to eight hours to allow the node owners to have access per their contracted agreement.
glean The glean queue will allow jobs to run free of charge on any idle condo nodes. These jobs will be terminated whenever the other queues receive job submissions that can use the idle nodes. This queue is exclusive to condo participants.

Job Queue Characteristics

Default Walltimes Changed

The default walltime for all queues is now one hour. Max cores has been updated on some queues as well. Max walltimes are still in force per the below list.

   Queue Name   limits
------------------------

   condo        max walltime = 8 hours
                default walltime = 1 hour
                max user cores = 512

   gpu-condo    max walltime = 8 hours
                max user cores = 84

   hotel        max walltime = 168 hours
                default walltime = 1 hour
                max user cores = varies

   gpu-hotel    max walltime = 168 hours
                max user cores = unlimited

   pdafm        max walltime = 72 hours
                default walltime = 1 hour
                max user cores = 96

   home         max walltime = unlimited
                default walltime = 1 hour
                max user cores = unlimited

   glean        default walltime = 1 hour
                max user cores = 1024

For the hotel, condo, pdafm and home queues, jobs charges are based on the number of cores allocated. Memory is allocated in proportion to the number of cores on the node.

Memory per Allocated Core by Queue Type
Queue	# Cores	Memory (GB)	GB Memory per Core
hotel	16	64	4
condo	16	64 or 128	4 or 8
pdafm	32	512	16
home	16	64 or 128	4 or 8

Queue Usage Policies and Restrictions

All nodes in the system are shared, and up to 16 jobs can run on each.

Anyone can submit to the hotel queue. The total number of processors in use by all jobs running via this queue is capped at 640. Jobs submitted to this queue run only on machines with 64GB memory. (Some nodes have 128GB; hotel nodes shouldn't run there).

The home queue is available to groups that have contributed nodes to the TSCC. Usage limits for those queues are equal to the number of cores contributed. Similarly, the condo queue is also restricted to contributors, so that sharing access to nodes in this queue becomes a benefit of contributing nodes to the cluster.

The glean queue is available only to node contributors of the condo cluster. Jobs are not charged but must run on idle cores and will be canceled immediately when the core is needed for a regular condo job.

Only members of Unix groups defined for node contributors are allowed to submit to the home queue. The home queue will route jobs to specific queues on the submitter's group membership, so the specific queue name is not used in the job submission. The total number of processors in use by all jobs running via each contributor's home queue is equal to the number of cores they contributed to the condo cluster.

Only members of Unix home groups are allowed to submit to condo (i.e., no hotel users). There is no total processor limit for the condo queue. If the system is sufficiently busy that all available processors are in use and both the hotel and condo queues have jobs waiting, the hotel jobs will run first as long as the total processors used by hotel jobs doesn't exceed the 640-processor limit. Condo jobs do not run on hotel nodes.

Note!

All TSCC nodes practically have slightly less than the nominal amount of memory available, due to system overhead. Jobs that attempt to use more than the specified proportion will be killed.

To submit a job for the PDAFM nodes, specify the pdafm queue. For example,

#PBS -q pdafm
#PBS -l nodes=2:np=20

To reduce email load on the mailservers, please specify an email address in your TORQUE script. For example, #!/bin/bash

#PBS -l walltime=00:20:00
#PBS -M <your_username@ucsd.edu>
#PBS -m mail_options

or using the command line:

qsub -m mail_options -M <your_username@ucsd.edu>

These mail_options are available:

    n no mail
    a mail is sent when the job is aborted by the batch system.
    b mail is sent when the job begins execution.
    e mail is sent when the job terminates.

Submitting a Job

Submitting with a Job Script

Submit a script to TORQUE:

qsub <batch_script>

The following is an example of a TORQUE batch script for running an MPI job. The script lines are discussed in the comments that follow.

#!/bin/csh
#PBS -q <queue name>
#PBS -N <job name>
#PBS -l nodes=10:ppn=2
#PBS -l walltime=0:50:00
#PBS -o <output file>
#PBS -e <error file>
#PBS -V
#PBS -M <email address list>
#PBS -m abe
#PBS -A <account name>
cd /oasis/tscc/scratch/<user name>
mpirun -v -machinefile $PBS_NODEFILE -np 20 <./mpi.out>

Comments for the above script:

#PBS -q <queue name>
Specify queue to which job is being submitted, one of:
- hotel
- gpu-hotel
- condo
- gpu-condo
- pdafm
- glean
#PBS -N <job name>
Specify name of job
#PBS -l nodes=10:ppn=2
Request 10 nodes and 2 processors per node.
#PBS -l walltime=0:50:00
Reserve the requested nodes for 50 minutes
6 #PBS -o <output file>
Redirect standard output to a file
#PBS -e <error file>
Redirect standard error to a file
#PBS -V
Export all my environment variables to the job
#PBS -M <email address list>
Comma-separated list of users to whom email is sent
#PBS -m abe
Set of conditions under which the execution server will send email about the job: (a)bort, (b)egin, (e)nd
#PBS -A <account name>
Specify account to be charged for running the job; optional if user has only one account. If more than one account is available and this line is omitted, job will be charged to default account.
To ensure the correct account is charged, it is recommended that the -A option always be used.
cd /oasis/tscc/scratch/<user name>
Change to user's working directory in the Lustre filesystem
mpirun -v -machinefile $PBS_NODEFILE -np 20 <./mpi.out>
Run as a parallel job, in verbose output mode, using 20 processes, on the nodes specified by the list contained in the file referenced by $PBS_NODEFILE, and send the output to file mpi.out in current working directory

TORQUE Commands

TORQUE Commands

Command	Description
qstat -a	Display the status of batch jobs
qdel <pbs_jobid>	Delete (cancel) a queued job
qstat -r	Show all running jobs on system
qstat -f <pbs_jobid>	Show detailed information of the specified job
qstat -q	Show all queues on system
qstat -Q	Show queues limits for all queues
qstat -B	Show quick information of the server
pbsnodes -a	Show node status

*View the qstat manpage for more options.

Submitting an Interactive Job

The following is an example of a TORQUE command for running an interactive job.

qsub -I -l nodes=10:ppn=2 -l walltime=0:50:00

The standard input, output, and error streams of the job are connected through qsub to the terminal session in which qsub is running.

Monitoring Batch Queues

Users can monitor batch queues using these commands:

qstat

The command output shows the job Ids and queues, for example:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
90.tscc-46                PBStest          hocks                  0 R hotel
91.tscc-46                PBStest          hocks                  0 Q hotel
92.tscc-46                PBStest          hocks                  0 Q hotel

showq

This command shows the jobs running, queued and blocked:

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME
94                    hocks    Running     8    00:09:53  Fri Apr  3 13:40:43
1 active job               8 of 16 processors in use by local jobs (50.00%)
                            8 of 8 nodes active      (100.00%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT              QUEUETIME
95                    hocks       Idle     8    00:10:00  Fri Apr  3  13:40:04
96                    hocks       Idle     8    00:10:00  Fri Apr  3  13:40:05
2 eligible jobs

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT             QUEUETIME
0 blocked jobs
Total jobs:  3

showbf

This command gives information on available time slots:

Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  --------------
ALL               8      8      INFINITY      00:00:00  13:45:30_04/03

Users who are trying to choose parameters that allow their jobs to run more quickly may find this a convenient way to find open nodes and time slots.

Obtaining Support for TSCC Jobs

For any questions, please send email to TSCC Support.

Have a question or concern?

We provide two methods of reporting problems — the mailing list, and the ticketing system. We make every effort to be responsive and timely to both paths. If you have an individual problem or question, we strongly encourage you to use the ticketing system.

Quick Start Guide to Running Jobs on TSCC

System Access - Logging In

To login to the TSCC, use the following hostname:

Important Guidelines for Running Jobs

Running Batch Jobs

Running Jobs with TORQUE

Job Queue Summary Descriptions

Job Queue Characteristics

Job Queue Characteristics

Default Walltimes Changed

Queue Usage Policies and Restrictions

Note!

Submitting a Job

TORQUE Commands

Submitting an Interactive Job

Monitoring Batch Queues

Obtaining Support for TSCC Jobs

Have a question or concern?

Computing