tscc-login.sdsc.edu
Following are examples of Secure Shell (ssh) commands that may be used to login to the TSCC:
ssh <your_username>@tscc-login.sdsc.edu
ssh -l <your_username> tscc-login.sdsc.edu
More information about Secure Shell may be found in the New User guide. SDSC security policy may be found at the SDSC Security site. Download the TSCC Quick Reference Guide [PDF].
/home/$USER
). NFS filesystems have a single server which handles all the metadata and storage requirements. This means that if a job writes from multiple compute nodes and cores, the load is focused on this one server./oasis/tscc/scratch
) is optimized for efficient handling of large files, however it doesn't work nearly as well when writing many small files. We recommend using this filesystem only if your metadata load is modest – i.e., you have O(10)-O(200) files open simultaneously./state/partition1/$USER/$PBS_JOBID
) if your job writes a lot of files from each task. The local scratch filesystem is purged at the end of each job, so you will need to copy out files that you want to retain after the job completes.TSCC uses the TORQUE Resource Manager (also known by its historical name Portable Batch System, or PBS) with the Maui Cluster Scheduler to define and manage job queues. TORQUE allows the user to submit one or more jobs for execution, using parameters specified in a job script.
The intended uses for the submit queues are as follows:
The default walltime for all queues is now one hour. Max cores has been updated on some queues as well. Max walltimes are still in force per the below list.
Queue Name limits ------------------------ condo max walltime = 8 hours default walltime = 1 hour max user cores = 512 gpu-condo max walltime = 8 hours max user cores = 84 hotel max walltime = 168 hours default walltime = 1 hour max user cores = varies gpu-hotel max walltime = 168 hours max user cores = unlimited pdafm max walltime = 72 hours default walltime = 1 hour max user cores = 96 home max walltime = unlimited default walltime = 1 hour max user cores = unlimited glean default walltime = 1 hour max user cores = 1024
For the hotel, condo, pdafm and home queues, jobs charges are based on the number of cores allocated. Memory is allocated in proportion to the number of cores on the node.
Queue | # Cores | Memory (GB) | GB Memory per Core |
---|---|---|---|
hotel | 16 | 64 | 4 |
condo | 16 | 64 or 128 | 4 or 8 |
pdafm | 32 | 512 | 16 |
home | 16 | 64 or 128 | 4 or 8 |
All nodes in the system are shared, and up to 16 jobs can run on each.
Anyone can submit to the hotel queue. The total number of processors in use by all jobs running via this queue is capped at 640. Jobs submitted to this queue run only on machines with 64GB memory. (Some nodes have 128GB; hotel nodes shouldn't run there).
The home queue is available to groups that have contributed nodes to the TSCC. Usage limits for those queues are equal to the number of cores contributed. Similarly, the condo queue is also restricted to contributors, so that sharing access to nodes in this queue becomes a benefit of contributing nodes to the cluster.
The glean queue is available only to node contributors of the condo cluster. Jobs are not charged but must run on idle cores and will be canceled immediately when the core is needed for a regular condo job.
Only members of Unix groups defined for node contributors are allowed to submit to the home queue. The home queue will route jobs to specific queues on the submitter's group membership, so the specific queue name is not used in the job submission. The total number of processors in use by all jobs running via each contributor's home queue is equal to the number of cores they contributed to the condo cluster.
Only members of Unix home groups are allowed to submit to condo (i.e., no hotel users). There is no total processor limit for the condo queue. If the system is sufficiently busy that all available processors are in use and both the hotel and condo queues have jobs waiting, the hotel jobs will run first as long as the total processors used by hotel jobs doesn't exceed the 640-processor limit. Condo jobs do not run on hotel nodes.
All TSCC nodes practically have slightly less than the nominal amount of memory available, due to system overhead. Jobs that attempt to use more than the specified proportion will be killed.
To submit a job for the PDAFM nodes, specify the pdafm queue. For example,
#PBS -q pdafm
#PBS -l nodes=2:np=20
To reduce email load on the mailservers, please specify an email address in your TORQUE script. For example, #!/bin/bash
#PBS -l walltime=00:20:00
#PBS -M <your_username@ucsd.edu>
#PBS -m mail_options
or using the command line:
qsub -m mail_options -M <your_username@ucsd.edu>
These mail_options are available:
n no mail a mail is sent when the job is aborted by the batch system. b mail is sent when the job begins execution. e mail is sent when the job terminates.
See also the Charge Policies page.
Submit a script to TORQUE:
qsub <batch_script>
The following is an example of a TORQUE batch script for running an MPI job. The script lines are discussed in the comments that follow.
#!/bin/csh
#PBS -q <queue name>
#PBS -N <job name>
#PBS -l nodes=10:ppn=2
#PBS -l walltime=0:50:00
#PBS -o <output file>
#PBS -e <error file>
#PBS -V
#PBS -M <email address list>
#PBS -m abe
#PBS -A <account name>
cd /oasis/tscc/scratch/<user name>
mpirun -v -machinefile $PBS_NODEFILE -np 20 <./mpi.out>
Comments for the above script:
#PBS -q <queue name>
#PBS -N <job name>
#PBS -l nodes=10:ppn=2
#PBS -l walltime=0:50:00
6 #PBS -o <output file>
#PBS -e <error file>
#PBS -V
#PBS -M <email address list>
#PBS -m abe
#PBS -A <account name>
To ensure the correct account is charged, it is recommended that the -A option always be used.
cd /oasis/tscc/scratch/<user name>
mpirun -v -machinefile $PBS_NODEFILE -np 20 <./mpi.out>
Command | Description |
---|---|
qstat -a | Display the status of batch jobs |
qdel <pbs_jobid> | Delete (cancel) a queued job |
qstat -r | Show all running jobs on system |
qstat -f <pbs_jobid> | Show detailed information of the specified job |
qstat -q | Show all queues on system |
qstat -Q | Show queues limits for all queues |
qstat -B | Show quick information of the server |
pbsnodes -a | Show node status |
*View the qstat manpage for more options.
The following is an example of a TORQUE command for running an interactive job.
qsub -I -l nodes=10:ppn=2 -l walltime=0:50:00
The standard input, output, and error streams of the job are connected through qsub to the terminal session in which qsub is running.
Users can monitor batch queues using these commands:
qstat
The command output shows the job Ids and queues, for example:
Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 90.tscc-46 PBStest hocks 0 R hotel 91.tscc-46 PBStest hocks 0 Q hotel 92.tscc-46 PBStest hocks 0 Q hotel
showq
This command shows the jobs running, queued and blocked:
active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME 94 hocks Running 8 00:09:53 Fri Apr 3 13:40:43 1 active job 8 of 16 processors in use by local jobs (50.00%) 8 of 8 nodes active (100.00%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 95 hocks Idle 8 00:10:00 Fri Apr 3 13:40:04 96 hocks Idle 8 00:10:00 Fri Apr 3 13:40:05 2 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 blocked jobs Total jobs: 3
showbf
This command gives information on available time slots:
Partition Tasks Nodes Duration StartOffset StartDate --------- ----- ----- ------------ ------------ -------------- ALL 8 8 INFINITY 00:00:00 13:45:30_04/03
Users who are trying to choose parameters that allow their jobs to run more quickly may find this a convenient way to find open nodes and time slots.
For any questions, please send email to TSCC Support.
We provide two methods of reporting problems — the mailing list, and the ticketing system. We make every effort to be responsive and timely to both paths. If you have an individual problem or question, we strongly encourage you to use the ticketing system.