High Throughput Computing Administration Guide

This document contains SLURM administrator information specifically for high throughput computing, namely the execution of many short jobs. Getting optimal performance for high throughput computing does require some tuning and this document should help you off to a good start. A working knowledge of SLURM should be considered a prerequisite for this material.

Performance Results

SLURM has also been validated to execute 500 simple batch jobs per second on a sustained basis with short bursts of activity at a much higher level. Actual performance depends upon the jobs to be executed plus the hardware and configuration used.

System configuration

Three system configuration parameters must be set to support a large number of open files and TCP connections with large bursts of messages. Changes can be made using the /etc/rc.d/rc.local or /etc/sysctl.conf script to preserve changes after reboot. In either case, you can write values directly into these files (e.g. "echo 32832 > /proc/sys/fs/file-max").

  • /proc/sys/fs/file-max: The maximum number of concurrently open files. We recommend a limit of at least 32,832.
  • /proc/sys/net/ipv4/tcp_max_syn_backlog: Maximum number of remembered connection requests, which are still did not receive an acknowledgment from connecting client. The default value is 1024 for systems with more than 128Mb of memory, and 128 for low memory machines. If server suffers of overload, try to increase this number.
  • /proc/sys/net/core/somaxconn: Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to 128. The value should be raised substantially to support bursts of request. For example, to support a burst of 1024 requests, set somaxconn to 1024.

The transmit queue length (txqueuelen) may also need to be modified using the ifconfig command. A value of 4096 has been found to work well for one site with a very large cluster (e.g. "ifconfig txqueuelen 4096").

User limits

The ulimit values in effect for the slurmctld daemon should be set quite high for memory size, open file count and stack size.

SLURM Configuration

NOTE: Substantial changes were made in SLURM version 2.4 to support higher throughput rates. Version 2.5 includes more enhancements.

Several SLURM configuration parameters should be adjusted to reflect the needs of high throughput computing. The changes described below will not be possible in all environments, but these are the configuration options that you may want to consider for higher throughput.

  • AccountingStorageType: Disabling accounting will improve job throughput. Disable storing of accounting by using the accounting_storage/none plugin.
  • JobAcctGatherType: Disabling the collection of job accounting information will improve job throughput. Disable collection of accounting by using the jobacct_gather/none plugin.
  • JobCompType: Disabling recording of job completion information will improve job throughput. Disable recording of job completion information by using the jobcomp/none plugin.
  • MaxJobCount: Controls how many jobs may be in the slurmctld daemon records at any point in time (pending, running, suspended or completed[temporarily]). The default value is 10,000.
  • MessageTimeout: Controls how long to wait for a response to messages. The default value is 10 seconds. While the slurmctld daemon is highly threaded, its responsiveness is load dependent. This value might need to be increased somewhat.
  • MinJobAge: Controls how soon the record of a completed job can be purged from the slurmctld memory and thus not visible using the squeue command. The record of jobs run will be preserved in accounting records and logs. The default value is 300 seconds. The value should be reduced to a few seconds if possible. Use of accounting records for older jobs can increase the job throughput rate compared with retaining old jobs in the memory of the slurmctld daemon.
  • ProctrackType: Avoid using proctrack/cgroup it is considerably slower than other alternatives. proctrack/pgid (the default) is advised.
  • PriorityType: The priority/builtin is considerably faster than other options, but schedules jobs only on a First In First Out (FIFO) basis.
  • SchedulerParameters: Several scheduling parameters are available.
    • Setting option defer will avoid attempting to schedule each job individually at job submit time, but defer it until a later time when scheduling multiple jobs simultaneously may be possible. This option may improve system responsiveness when large numbers of jobs (many hundreds) are submitted at the same time, but it will delay the initiation time of individual jobs.
    • A variation of defer would be to configure default_queue_depth to a relatively small number to avoid attempting to schedule large numbers of jobs every time some job completes or another routine action occurs. (NOTE: the default value of default_queue_depth should be fine in most cases).
    • The sched/backfill plugin has relatively high overhead if used with large numbers of job. Configuring max_job_bf to a modest size (say 100 jobs or less) and bf_interval to 30 seconds or more will limit the overhead of backfill scheduling (NOTE: the default values are fine for both of these parameters). Other backfill options available for tuning backfill scheduling include bf_max_job_user, bf_resolution and bf_window. See the slurm.conf man page for details.
  • SchedulerType: If most jobs are short lived then use of the sched/builtin plugin is recommended. This manages a queue of jobs on a First-In-First-Out (FIFO) basis and eliminates logic used to sort the queue by priority.
  • SelectType: If only serial jobs (single CPU jobs) are to be executed, use of the select/serial plugin is recommended. This plugin eliminates much of the logic found in other select plugins to optimize job allocations with respect to network topology. It also reduces communications required to executed batch jobs by 75 percent through the use of a "pull" model, where the completeion of each job by the slurmd daemon initiates a single message to the slurmctld daemon upon completion of the job's epilog script. The response to this message can be information required to initiate another job utilizing the resources just released by the previous job. NOTE: The use of select/serial prevents the job's Epilog program from being initiated with any SPANK environment variables.
  • SlurmctldPort: It is desirable to configure the slurmctld daemon to accept incoming messages on more than one port in order to avoid having incoming messages discarded by the operating system due to exceeding the SOMAXCONN limit described above. Using between two and ten ports is suggested when large numbers of simultaneous requests are to be supported.
  • SlurmctldDebug: More detailed logging will decrease system throughput. Set to 2 (log errors only) or 3 (general information logging). Each increment in the logging level will increase the number of message by a factor of about 3.
  • SlurmdDebug: More detailed logging will decrease system throughput. Set to 2 (log errors only) or 3 (general information logging). Each increment in the logging level will increase the number of message by a factor of about 3.
  • SlurmdLogFile: Writing to local storage is recommended.
  • TaskPlugin: Avoid using task/cgroup it is slower than other alternatives. It is not as heavy as the cgroup proctrack plugin, but still added overhead. On the same note task/affinity does not appear to add any measurable overhead.
  • Other: Configure logging, accounting and other overhead to a minimum appropriate for your environment.

Last modified 21 May 2013