Skip to content

Commit

Permalink
Merge pull request #463 from argonne-lcf/fix-queue-changes
Browse files Browse the repository at this point in the history
the changes I pushed yesterday had the default queue backwards
  • Loading branch information
felker committed Aug 14, 2024
2 parents 84d5ad3 + 7df4913 commit 7f87673
Showing 1 changed file with 4 additions and 3 deletions.
7 changes: 4 additions & 3 deletions docs/sophia/queueing-and-running-jobs/running-jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

### Nodes vs Queue
The GPU nodes are NVIDIA DGX A100 nodes and each node contains eight (8) A100 GPUs. The majority of the nodes have the 40 GB A100 models, but two special nodes contain the 80 GB A100 models (see below). You may request either an entire node, or a single GPU based on your job needs. What you will get is determined by the queue you submit to (See Queues section below).
If it has node in the name, you will get nodes. If it has GPU in the name, you will get a single GPU. Note that if you need more than a single GPU, you should submit to the `single-node` queue.


## <a name="Sophia-Queues"></a>Queues

There are three primary queues:

- `by-node`: This is the **default^1^** production queue and is targeted at jobs that can utilize more than 8 GPUs. The number of "chunks" you specify in your `qsub` (i.e. `-l select=4`) will be the number of Sophia DGX nodes (with 8 GPUs each) you are allocated. Valid values are 1-22 in your select statement. If you request more than 22 your job will never run due to lack of resources.
- `by-gpu`: This is the general production queue for jobs that can utilize 1-8 GPUs. The number of "chunks" you specify in your `qsub` (i.e. `-l select=4`) will be the number of GPUs you are allocated and they will all be on the same node. Valid values are 1,2,4, or 8 in your select statement. These restrictions ensure you get a sane set of resources (RAM is in the same NUMA node as the cores, the GPU has the minimal hops to the GPU, etc). If you specify a different value your qsub will issue an error and fail.
- `by-gpu`: This is the **default^1^** production queue and is targeted at jobs that can utilize 1-8 GPUs. The number of "chunks" you specify in your `qsub` (i.e. `-l select=4`) will be the number of GPUs you are allocated and they will all be on the same node. Valid values are 1,2,4, or 8 in your select statement. These restrictions ensure you get a sane set of resources (RAM is in the same NUMA node as the cores, the GPU has the minimal hops to the GPU, etc). If you specify a different value your qsub will issue an error and fail.
- `by-node`: This production queue is tarted at jobs that can utilize more than 8 GPUs. The number of "chunks" you specify in your `qsub` (i.e. `-l select=4`) will be the number of Sophia DGX nodes (with 8 GPUs each) you are allocated. Valid values are 1-22 in your select statement. If you request more than 22 your job will never run due to lack of resources.
- `bigmem`: 2 of the nodes have 80GB of RAM per GPU, while the other 22 have 40GB of RAM per GPU (640 GB of aggregate GPU memory per node vs 320 aggregate GPU memory per node). Use this queue to access one of these 2 nodes by specifying ```-q bigmem``` in your script. A max of 1 node (`-l select=1`) can be requested in this queue.

**1:** The default queue is where your job will be submitted if you don't have `-q <queue name>` in your qsub
Expand Down Expand Up @@ -39,3 +39,4 @@ Here are the initial queue limits. You may not violate any of these policies whe
- MaxRunning will be 5 concurrent jobs (per project)

The initial queue policy will be simple First-In-First-Out (FIFO) based on priority with EASY backfill. `by-queue` and `by-gpu` queues target non-bigmem nodes.
The old `single-node` queue is now a routing queue (redirect) to the `by-node` and the old `single-gpu` queue is now a routing queue (redirect) to the `by-gpu` queue.

0 comments on commit 7f87673

Please sign in to comment.