Bristen¶
Bristen is an Alps cluster that provides GPU accelerators and filesystems designed to meet the needs of machine learning workloads in the MLP.
Cluster Specification¶
Compute Nodes¶
Bristen consists of 32 A100 nodes NVIDIA A100 nodes. The number of nodes can change when nodes are added or removed from other clusters on Alps.
node type | number of nodes | total CPU sockets | total GPUs |
---|---|---|---|
a100 | 32 | 32 | 128 |
Nodes are in the normal
slurm partition.
Storage and file systems¶
Bristen uses the MLp filesystems and storage policies.
Getting started¶
Logging into Bristen¶
To connect to Bristen via SSH, first refer to the ssh guide.
~/.ssh/config
Add the following to your SSH configuration to enable you to directly connect to bristen using ssh bristen
.
Software¶
Users are encouraged to use containers on Bristen.
- Jobs using containers can be easily set up and submitted using the container engine.
- To build images, see the guide to building container images on Alps.
Running Jobs on Bristen¶
SLURM¶
Bristen uses SLURM as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
There is currently a single slurm partition on the system:
- the
normal
partition is for all production workloads.- nodes in this partition are not shared.
name | nodes | max nodes per job | time limit |
---|---|---|---|
normal |
32 | - | 24 hours |
FirecREST¶
Bristen can also be accessed using FircREST at the https://api.cscs.ch/ml/firecrest/v1
API endpoint.
Scheduled Maintenance¶
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the CSCS status page.
Change log¶
2025-03-05 container engine updated
now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates.