# Custom-Torque Cluster

The custom_torque setup uses the open source Torque job manager (the free version of the PBS job manager) for the cluster. The custom_torque job manager requires the "Matlab Compiler", but does not require the Matlab parallel toolbox OR the Matlab distributed computing server. Functions that are run on the cluster nodes are compiled by the Matlab Compiler. Each job runs a "worker" shell script which calls the "worker_task.m" function. The "worker_task.m" function is the one that gets compiled. The actual function you want to call is passed in as a function pointer (e.g. @csarp_task) to the worker_task and that is how it gets called.

## Steps for setting up and using the custom-torque

2. You can only use machines that are directly connected to the cluster.
3. Set the gRadar.sched.type variable to "custom_torque".
4. Set the gRadar.sched.data_location variable to your scratch space (this is where all the batch information will be stored). A batch is a set of jobs that use a common data location. The scratch space contains the user name file, job list file, input args, output args, stdout, and errors.
5. Set the gRadar.sched.url is not used
IU: '-l nodes=1:ppn=1:dcwan,pmem=2gb,walltime=15:00'
KU/Field: '-l nodes=1:ppn=1,pmem=2gb,walltime=15:00'
• Primary differences for IU are that you need to specify that the "dcwan" resource must be available on the node. For memory intensive applications you may need to set the pmem to a higher value than 2gb. For CPU intensive applications you may need to increase the processing time (walltime). Increasing ppn (processors per node) can also be helpful in reducing the load per node which may help with memory bus, network bandwidth, and other bottlenecks, but does mean less jobs can run simultaneously.
7. Set gRadar.sched.max_in_queue to 100 to not clog the queue (recommended)
8. Set gRadar.sched.stop_on_fail to true (recommended)
9. Set gRadar.sched.max_retries to 4 (recommended)
11. Set your hidden dependencies variable
12. Set gRadar.sched.force_compile to false (recommended)
13. Set gRadar.sched.rerun_only to false (recommended)

## Usage

We define "batch" to be a group of jobs. Usually, but not necessarily, any one Matlab session will have one batch of jobs running. Matlab's terminology for cluster computing is different: our "batch" is a Matlab "job" and our "job" is a Matlab "task".

The job ID in Matlab is not the same as the Torque job ID. Matlab job IDs start at "1" for each batch. Torque job IDs are unique across all jobs submitted since the torque server was restarted. From the Bash shell you will always use torque job IDs and from the Matlab shell you will use Matlab job IDs.

All the status information for a batch is stored in the "ctrl" variable. This variable is passed to and modified by the custom torque functions. From the job submission Matlab, the ctrl variable is immediately available. To update ctrl and show the most recent Torque information, we use torque_job_status. For other instances of Matlab, use torque_job_list to get the ctrl structure. The most useful fields for the ctrl structure are:

FieldDescription
error_maskVector which contains the error mask code for every job. It is not valid if the job has not completed or exited. 0 = no error, 1 = output file does not exist, 2 = corrupt output file, 3 = Failure in job's matlab code generally caused by an exception (i.e. "error" function being called) not being caught, 4 = non-conforming output in output file
job_statusVector which contains the torque job status for every job. 'Q' for queued, 'H' for hold, 'R' for running, 'E' for exiting, and 'C' for complete. Display using "char(ctrl.job_status)" function

Useful commands for debugging from Matlab:

• torque_batch_list to see list of batches (only shows batches created in your gRadar.sched.data_location directory)
• torque_cleanup to delete specific batches by hand
• torque_compile compiles the worker_task.m function (a compile is required every time a change is made to the cluster task functions or their dependencies)
• torque_exec_job to manually rerun specific jobs locally. For example to run batch_id 1, job_id's 2 and 3: torque_exec_job(torque_job_list([],1),[2 3]). If you already have the batch's ctrl structure, then torque_exec_job(ctrl,[2 3]). There are several modes that the job can be run in.
• torque_hold to stop/start new jobs from running or to force the running script into a keyboard statement for debugging
• torque_job_list to get the ctrl structure for a particular batch
• torque_job_status to get the status of all jobs for a particular batch
• torque_print to print all the information for a job out and to get the input and output arguments
• torque_submit_batch is an example function for submitting a batch job (useful for running one off jobs on the cluster and testing the cluster interface)

Common Matlab debugging commands:

  torque_print(BATCH_ID,JOB_ID) % prints information about JOB_ID of BATCH_ID

ctrl = torque_job_list([],BATCH_ID) % gets the ctrl structure (usually used by other matlab sessions besides submission session)
torque_exec_job(ctrl,JOB_ID) % Execute job in local mode with no compiling
% Fix things...
torque_compile; % Recompile after fixes
torque_exec_job(ctrl,JOB_ID,3) % Execute compiled version of job

torque_hold(BATCH_ID,2) % To add hold which causes Matlab submission task to drop to debug prompt (and stop submitting jobs)
torque_hold(BATCH_ID,3) % To remove hold so that dbcont will continue the Matlab submission task so it can start submitting jobs again

% If in the keyboard prompt after torque_rerun runs out of retries:
% * Look at ctrl.error_mask to see which jobs failed
% * Use torque_print to see why they failed
% * Use torque_exec_job to run them again and debug them
% Once done debugging, just call "ctrl = torque_rerun(ctrl);" again (don't forget to compile and test compiled version!)
% If for some reason you fall out of debug mode and lose your session, you can either:
% 1) use torque_job_list to get the ctrl structure again and then call torque_rerun.
%    This will only run the batch's jobs and not any subsequent code.
% 2) Set "ctrl.rerun_only" to true and recall the function the submits the batch.
%    Most function will then only recreate the outputs that do not already exist.

torque_cleanup('.*'); % Deletes all your batch jobs including those that are running
torque_cleanup([2 3]); % Deletes batch jobs 2 and 3


Useful commands for debugging from Command Line Shell (bash):

• pbsnodes, pbsnodes -l, pbsnodes -a, showq, qstat -q, qstat | grep jpaden, qstat -r, qdel, qdel 'all', checkjob TORQUE_JOB_ID, diagnose -q etc. You'll need these to see what other users are doing.

Two of the debugging commands (cleanup, hold), take a variety of inputs. You can specify a specific "ctrl" structure or you can provide a vector of numeric batch IDs, or you can specify a regular expression that operates on specific users (remember that '.*' is all users and not '*' because it uses regular expression syntax and not the usual wild characters).

Useful files for debugging (or use torque_print):

• Watch stdout files in the batch directory, each job has its own stdout file (contains stdout outputs)
• Watch error files in the batch directory, each job has its own error file (contains stderr outputs). Most errors show up in stdout, but some show up here.
• In and out directories may also be helpful if you want to load the input or output arguments respectively for a particular job

### Job Failures

If a job fails and gRadar.sched.stop_on_fail is true, then after all jobs have completed, the function should go into "keyboard" mode. You should view the "ctrl.error_mask" variable to look for jobs that have failed (any job with it's error mask set to non-zero). Then run torque_print(batch_id,job_id) to determine why the job failed. At IU, about 0.1% of jobs fail for a non-repeatable reason (e.g. File System Fails temporarily) and sometimes it is convenient to set stop_on_fail to false when running at IU so you don't have to worry about the keyboard mode stopping execution. However, this should generally only be done when you know that you have things working and that resubmission of the same job will probably succeed so you don't waste cluster resources running a broken job multiple times.

If for some reason, execution of a batch fails, you can set gRadar.sched.rerun_only to true and rerun the batch. For a compliant function, the function will monitor which outputs already exist and only submit jobs for missing outputs. Typically, rerun_only should be set to false because in rerun_only set to true mode the system does not remove temporary files so you'll have to remove temporary output files manually.

Often it is best to leave your gRadar.sched variables in the default state and use the run_master param_override functionality to set them to the non-default state in special cases only.

### Running Interactively (for debugging)

Interactive mode lets you login to the cluster node with the environment setup in the same way it is when the job actually runs. This is helpful for debugging problems that have to do with running your code on the cluster nodes themselves.

To run interactively, set sched.interactive to true and then run your batch like you normally would. For each job, the torque_submit_job function will print out the steps to run the job interactively. It is not critical, but to keep the torque submission routines in sync you need to assign the new_job_id as one of these steps. Look for the output from qsub like this:

qsub: waiting for job 2466505.m2 to start
... INTERACTIVE SESSION ...
qsub: job 2466505.m2 completed


In this case, run "new_job_id = 2466505" in matlab. m2 is the queue and is not part of the job id.

When debugging at IU, it is a good idea to add '-q debug' (short jobs) or '-q interactive' (long jobs) to your sched.submit_arguments because your job will get processed much faster than the default queue.

There are two methods to group multiple tasks per job: 'group' and 'fill'. Each method uses a special form of the submit arguments that allows the wall time to be a function of the number of tasks per job. This is set using group_submit_arguments and group_walltime parameters. A single "%d" should be placed where the walltime is set in group_submit_arguments. The group_walltime is the amount of time per task.

#### Group

In the group mode, you set submit_mode to the string 'group' and you set the group_size to be the number of tasks per job. The example below would submit 10 tasks per job since the group_size equals 10.

param_override.sched.submit_mode = 'group'; % 'group' or 'fill'
param_override.sched.group_size = 10;
param_override.sched.group_submit_arguments = '-l nodes=1:ppn=1,pmem=10000mb,walltime=%d:00';
param_override.sched.group_walltime = 480;


#### Fill

In the fill mode, you set submit_mode to the string 'fill' and you set the num_submissions to the total number of jobs. The tasks will then be divided equally between these jobs. The example below will run in 16 jobs total and all the tasks will be divided between these jobs.

param_override.sched.submit_mode = 'fill'; % 'group' or 'fill'
param_override.sched.num_submissions = 16;
param_override.sched.group_submit_arguments = '-l nodes=1:ppn=1,pmem=10000mb,walltime=%d:00';
param_override.sched.group_walltime = 480;


### Issues and Solutions

• The "_task" function cannot be inside a package, but all other functions (including functions called by the _task function) may be in a package. (A package is a namespace device in Matlab where all package functions are put into a directory starting with a "+".)
• The Matlab compiler should not be run from a package directory. For example, you may not be inside the package "+tomo" and run the matlab compiler. Function dependencies can end up incorrect in the compiled file if the compiler is run in a package directory. torque_compile tries to check for this condition to prevent it from happening.
• Had to "rm -rf ~/.matlab/mcr*" to fix a directory permission problem in this directory.
• To force a recompile (necessary if you change matlab versions or add hidden dependencies to startup since this cannot automatically be detected): run torque_compile.m with no input arguments
• Occasionally a file permission problem will occur that is temporary (i.e. just typing dbcont at the torque_rerun keyboard prompt will work). An example of this type of error:
Use torque_print(batch_id,ctrl_id) to print out stdout from the failed job. The example of failure in stdout file:
stdin: is not a tty
terminate called after throwing an instance of 'dsFileBasedLockError'
what():  File system error occurred: Exclusive create failed on /users/paden/.matlab/mcr_v716/.deploy_lock.1 (reason: Stale NFS file handle)


Potentially, verify .deploy_lock directory is gone before running dbcont to resubmit the job(s) that failed:

ls -la /users/paden/.matlab/mcr_v716/


### Compiling and hidden_depend_funs

With a proper setup, changes to dependent function will automatically be detected and the worker_task will be recompiled automatically before it runs.

For this to happen, all "_task" functions called from worker_task.m through the function pointer AND all hidden dependencies (e.g. functions that are passed in as function pointers/strings) need to be listed in gRadar.sched.hidden_depend_funs list. This is set in the startup.m. Each cell entry is a 2 element cell vector where the first entry is the filename string and the second entry is a date-check-level integer. Currently there are three levels of date checking:

• 0: Do not check this file for changes (this is for Matlab functions that never change)
• 1: Always check this file and its dependencies for changes (this is for cresis-toolbox functions that are called through function pointer passing so that Matlab will not be able to automatically detect dependency on the function)
• 2: Only check this file for changes when the "fun" parameter is not specifically set when calling torque_compile (i.e. if a specific function is not specified during compile time, then this forces all functions to be date checked)

Technically every file could be set to level 1, but that would slow the date checking down a lot. It is better to specify 0 or 2 when it is appropriate.

Occasionally, you will need to force a recompile. If things are working properly, this should only need to be done when changes are made to the hidden dependency list and the function you want to call is going to use those new functions. Three examples are given below (note the time to execute each):

% Force a check on just the function to be called: csarp_task function
% and compile if there were changes (best)
Elapsed time is 6.715058 seconds.

% Force a check on EVERY function and compile if there were changes
>> tic; torque_compile([],[],0); toc
Elapsed time is 18.371072 seconds.

% Force a compile (skips the check and just runs the compile, necessary when new function dependencies are added)
>> torque_compile
Start Compiling 07-Dec-2012 07:11:57

Startup Script Running
Resetting path
Setting global preferences in global variable gRadar
path instance.

Done Compiling 07-Dec-2012 07:12:42
Elapsed time is 45 seconds.


## Developing code with the custom-torque

There are typically five steps you need to add to your source code:

1. Create a new batch and make sure compiled worker_task.m is up to date (torque_new_batch and torque_compile):
 global ctrl; % Make this global for convenience in debugging
ctrl = torque_new_batch(param); % The param structure should have gRadar fields at least
fprintf('Torque batch: %s\n', ctrl.batch_dir); % Nice for user tracking

2. Create jobs, this function is typically run in a loop (torque_create_task):
 fh = @get_heights_task;
create_task_param.conforming = true; % Currently, this has no effect since all functions must be conforming
create_task_param.notes = sprintf('%d/%d records %d-%d',frame, break_idx, cur_recs(1), cur_recs(end));

3. Wait for jobs to complete and rerun jobs that fail (torque_rerun):
 ctrl = torque_rerun(ctrl);
if ctrl.sched.stop_on_fail
torque_cleanup(ctrl);
error('Not all jobs completed, but out of retries (%s)', datestr(now));
else
warning('Not all jobs completed, but out of retries (%s)', datestr(now));
keyboard;
end
else
fprintf('Jobs completed (%s)\n\n', datestr(now));
end

4. Clean up files which delete temporary files in the batch directory and kills any queued or running jobs (torque_cleanup):
 torque_cleanup(ctrl);

• And then you need to add the function called by "torque_create_task" (in the above example it is get_heights_task) AND any hidden dependencies to your startup script gRadar.sched.hidden_depend_funs list. Any function which is not explicitly called (e.g. because it is passed in as a variable from the parameter spreadsheet like the windowing functions) must be listed here. If for some reason torque_compile.m does not automatically detect changes you can set gRadar.sched.force_compile to true to make it compile every time the function is run (or you can call torque_compile directly). If this is a permanent addition to the toolbox, then the example_startup.m file in the toolbox should be updated and pushed to the Git Repository.

# Monitoring Jobs on the Matlab local or jobmanager Scheduler

All the job monitoring functions (list_jobs, destroy_jobs, and get_job) make use of the gRadar.sched global structure. If your startup has not run and populated this variable, these function will not work without specifying the scheduler structure on the command line.

To list all jobs and return a handle to the job manager:

jm = list_jobs;

To destroy jobs by job ID:

destroy_jobs(job_id_vector);

To destroy jobs by owner:

destroy_jobs(job_owner_regular_expression);

destroy_jobs('.*'); <-- Destroys all jobs, since .* is a regular expression which matches every string

To get information about a specific job after having gotten the job manager handle by calling list_jobs:

job = get_job(jm,job_id)
[job,tasks] = get_job(jm,job_id) % To get job and tasks handles associated with the job ID

To get the list of tasks for that job:

To debug jobs, run them with "no scheduler" so you can use Matlab's IDE to debug the code. When you do this, enable the "Debugger-->Stop on Error" feature of Matlab.

Pay careful attention to which jobs/tasks failed and set breakpoints in the corresponding scheduling function (e.g. csarp.m) to force it to run that job/task first in no scheduler mode so you don't have to wait a long time to get to the particular task that failed.

All jobs run on the cluster as the mdce user/mdce group. Outputs generally need to be in directories with "rwxrwsr-x" permissions (set group bit). Occasionally permissions in directories are not correct and processing will fail when run from the cluster. The error may be something like:

Job 1 Task 2: Unable to write file /cresis/scratch2/mdce/mcords2/2012_Greenland_P3/CSARP_qlook/20120509_01/ql_data_001_01_01/20120509_110849_img_01.mat: No such file or directory.


# Monitoring Jobs on the Torque Scheduler

If you have problems at IU:

1. Karst: Help Desk Indiana (hps-admin@indiana.edu)
2. Data capacitor (Lustre file system on dcwan or dc2): hpfs@rtinfo.indiana.edu

Try to provide the following information:

• Job ID
• Time
• Host
• Command that failed/Error message
• Directory and file command failed on

## Torque Scheduler Shell Commands

Submission arguments

-q pg@qm2

 submits to queue pg (polargrid) at qm2 (torque scheduler)


 error files are placed here (not sure if this is used or not)


-l nodes=1:ppn=8:dcwan,walltime=15:00

 each process requires 1 node, 8 processors/cores per node, and also
requires the dcwan file server
each process requires a maximum of 15 minutes to run


-b: time-out for qsub response -q: specify a queue and server pg = polargrid, qm2 = torque scheduler -l: contraints

 nodes=1:ppn=2:dcwan:gpfs,walltime=15:00
each process will get 1 node (computer)
each process requires ppn=2 cores (forces 4 processes per node)
each process requires ppn=8 cores (forces 1 processes per node, good for doing records)
walltime: MM:SS format for how long the process will take
For records generation you may need to set 720:00 (12 hours)
For everything else you should set to 15:00 (15 minutes)
If you set walltime to small, your jobs will get killed before they are done
If you set walltime to large it causes the torque scheduler to break Matlab


-m p: disable email messages

Common status commands:

qstat -q
List all queued jobs
qstat -r
showq -i
qalter
qsub
qdel
Delete jobs (qdel -all or qdel TORQUE_JOB_ID)
showstart TORQUE_JOB_ID
Shows scheduled time for job. For example: showstart 3031675.m2
pbsnodes -l
Prints out node reservations
checkjob JOB
Prints out complete information about the job including scheduled time

## Temporary and Debug Files

When a job is created for the Torque scheduler it creates temporary files in your matlab_torque directory. Before the job is deleted by torque_cleanup, you can put a breakpoint and monitor the outputs (stdout files are especially useful for debugging since they contain all the stdout for the job).

## Compiling Code

A common problem is forgetting to add hidden dependencies to the gRadar.sched.hidden_depend_funs list in startup.m. This must include any functions you use that are passed in as variables; the most common are the lever arm functions and window functions.

## Random Information

CLUSTER NETWORK

Each node is connected via to a gigabit Ethernet switch, each switch serves 42 PG nodes that share a 10 gigabit connection into the research network. DCWAN is connected to this network via 10 gigabit Ethernet. http://rtinfo.indiana.edu/dc

INFO ON MAKING LUSTRE FASTER:

[14:54:10] <Justin Miller> he should also strip the directory he is writing too [14:54:12] <Justin Miller> http://www.nersc.gov/users/data-and-networking/optimizing-io-performance-for-lustre/#toc-anchor-3 [14:54:17] <Justin Miller> something like [14:54:48] <Justin Miller> lfs setstripe ?count 4 directory [14:55:10] <Justin Miller> it spreads the IO out across 4 disk objects [14:55:34] <Justin Miller> but would only count for new data, doing it to stuff on disk doesn't have any effect unless they're moved

INFO ON QUARRY TORQUE SCHEDULER

Also p228 and p229 are dedicated to polargrid. No need for a scheduler there. You also get a priority boost which should help out.

QUEUE INFO

Under normal circumstance all 230 nodes(minus problem nodes and login nodes) will be available via the scheduler. All nodes have dcwan outage. In the event of a dcwan outage we will always update the Message of the day and removed the dcwan feature from Torque.

The easiest way to see the status of nodes at a glance would probably be to run "pbsnodes -l" that will show a list of nodes that are down and/or offline. Keep in mind that login(and special used nodes) will always show up in the down list.

 225-227: test nodes


Torque: resource manager (pbsnodes)

 Keep track of which nodes/processors are available


MOAB: scheduler

 Queues are classes (showq)
showq -w class=pg
Much better than torques scheduler


Which computers and file servers to use for Nx and for submitting jobs:

pg228.quarry.teragrid.iu.edu pg227.quarry.teragrid.iu.edu

This has the NX Server on it. You will NX client to here. See wiki for help.

Quarry: THIS IS WHERE THE CLUSTER IS

Add this to your ~/.bashrc if you would like fast ways to connect from cresis terminal

alias quarry="ssh -Y quarry.teragrid.iu.edu" alias pg228="ssh -Y pg228.quarry.teragrid.iu.edu"

Now you can just type "quarry" and it will ssh into the Quarry cluster. When you ssh into quarry you will randomly be assigned to one of the head nodes b001 through b004. From there you can ssh to any of the other machines. You can also directly ssh into the other machines:

 ssh -Y b005.quarry.teragrid.iu.edu
ssh -Y pg228.quarry.teragrid.iu.edu


IMPORTANT DEFINITION wall-time: The time a process has on the CPU before it is killed. Note: this is actual processing time, so idle time does not count.

b001 through b004: 20 minute wall-time for every process

 Do not use these to submit jobs since your matlab session which


submits the jobs could run into the 20 minute time limit and be killed.

b005-b008: 24 hour wall-time

 These should be fine, but are often over used.
Check "top" to see how each node is loaded


pg228, pg229: these are CReSIS only and have no limits, but "showq -u jpaden" does not work from these nodes... not a big deal since you can log in to one of the head nodes for this.

From quarry, it is easy to ssh to other nodes; no password is required

 ssh b005, ssh b006, ssh b007, ssh b008