CP2K¶

CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems.

CP2K provides a general framework for different modeling methods such as DFT using the mixed Gaussian and plane waves approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, …), and classical force fields (AMBER, CHARMM, …). CP2K can do simulations of molecular dynamics, metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimization, and transition state optimization using NEB or dimer method. See CP2K Features for a detailed overview.

uenvs

CP2K is provided on ALPS via uenv. Please have a look at the uenv documentation for more information about uenvs and how to use them.

Dependencies¶

On our systems, CP2K is built with the following dependencies:

COSMA
Cray MPICH
DBCSR
DLA-Future
dftd4 (from cp2k@2025.1 onwards)
ELPA
FFTW
Libxc
libint
OpenBLAS
PLUMED (from cp2k@2024.1 onwards)
ScaLAPACK
SIRIUS
Spglib
spla

GPU-aware MPI

COSMA and DLA-Future are built with GPU-aware MPI, which requires setting MPICH_GPU_SUPPORT_ENABLED=1. On the HPC platform, MPICH_GPU_SUPPORT_ENABLED=1 is set by default.

CUDA cache path for JIT compilation

DBCSR uses JIT compilation for CUDA kernels. The default location is in the home directory, which can put unnecessary burden on the filesystem and lead to performance degradation. Because of this we set CUDA_CACHE_PATH to point to the in-memory filesystem in /dev/shm. On the HPC platform, CUDA_CACHE_PATH is set to a directory under /dev/shm by default.

BLAS/LAPACK on Eiger

On Eiger, the default BLAS/LAPACK library is Intel oneAPI MKL (oneMKL) until cp2k@2024.3. From cp2k@2025.1 the default BLAS/LAPACK library is OpenBLAS.

Running CP2K¶

Running on the HPC platform¶

To start a job, two bash scripts are potentially required: a slurm submission script, and a wrapper to start the CUDA MPS daemon so that multiple MPI ranks can use the same GPU.

run_cp2k.sh

#!/bin/bash -l

#SBATCH --job-name=cp2k-job
#SBATCH --time=00:30:00 (1)
#SBATCH --nodes=4
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=32 (2)
#SBATCH --cpus-per-task=8 (3)
#SBATCH --account=<ACCOUNT>
#SBATCH --hint=nomultithread
#SBATCH --hint=exclusive
#SBATCH --no-requeue
#SBATCH --uenv=<CP2K_UENV>
#SBATCH --view=cp2k

export CUDA_CACHE_PATH="/dev/shm/$USER/cuda_cache" # (5)!
export MPICH_GPU_SUPPORT_ENABLED=1 # (6)!
export MPICH_MALLOC_FALLBACK=1
export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)!

ulimit -s unlimited
srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>

Time format: HH:MM:SS
Number of MPI ranks per node
Number of CPUs per MPI ranks
OpenBLAS spawns an extra thread, therefore it is necessary to set OMP_NUM_THREADS to SLURM_CPUS_PER_TASK - 1 for good performance. With Intel MKL, this is not necessary and one can set OMP_NUM_THREADS to SLURM_CPUS_PER_TASK.
DBCSR relies on extensive JIT compilation, and we store the cache in memory to avoid I/O overhead. This is set by default on the HPC platform, but it's set here explicitly as it's essential to avoid performance degradation.
CP2K's dependencies use GPU-aware MPI, which requires enabling support at runtime. This is set by default on the HPC platform, but it's set here explicitly as it's a requirement in general for enabling GPU-aware MPI.
Change to your project account name
Change <CP2K_UENV> to the name (or path) of the actual CP2K uenv you want to use
Change <PATH_TO_CP2K_DATA_DIR> to the actual path to the CP2K data directory
Change <CP2K_INPUT> and <CP2K_OUTPUT> to the actual input and output files

With the above scripts, you can launch a CP2K calculation on 4 nodes, with 32 MPI ranks per node and 8 OpenMP threads per rank with

sbatch run_cp2k.sh

Note

The mps-wrapper.sh script, required to properly over-subscribe the GPU, is provided at the following page: NVIDIA GH200 GPU nodes: multiple ranks per GPU.

Warning

The --cpu-bind=socket option is necessary to get good performance.

Warning

Each GH200 node has 4 modules, each of them composed of a ARM Grace CPU with 72 cores and a H200 GPU directly attached to it. Please see Alps hardware for more information. It is important that the number of MPI ranks passed to slurm with --ntasks-per-node is a multiple of 4.

Note

In the example above, we use 32 MPI ranks with 8 OpenMP threads, for a total of 64 cores per GPU and 256 cores per node. Experiments have shown that CP2K performs and scales better when the number of MPI ranks is a power of 2, even if some cores are left idling.

Running regression tests

If you want to run CP2K regression tests with the CP2K executable provided by the uenv, make sure to use the version of the regression tests corresponding to the version of CP2K provided by the uenv. The regression test data is sometimes adjusted, and using the wrong version of the regression tests can lead to test failures.

Scaling of QS/H2O-1024 benchmark

The QS/H2O-1024 benchmark is a DFT molecular dynamics simulation of liquid water. It relies on DBCSR for block sparse matrix-matrix multiplication.

All calculations were run with 32 MPI ranks per node, and 8 OpenMP threads per rank (best configuration for this benchmark).

Note

H2O-102.inp is the largest example of DFT molecular dynamics simulation of liquid water that fits on a single GH200 node.

Number of nodes	Wall time (s)	Speedup	Efficiency
1	793.1	1.00	1.00
2	535.2	1.48	0.74
4	543.9	1.45	0.36
8	487.3	1.64	0.20
16	616.7	1.28	0.08

Scaling is not ideal on more than two nodes.

Scaling of QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ benchmark

The QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ benchmark is a straightforward modification of the QS_mp2_rpa/64-H2O/H2O-64-RI-MP2-TZ benchmark.

It is a RI-MP2 calculation of a water cluster with 128 atoms.

Input file

&GLOBAL
  PRINT_LEVEL MEDIUM
  PROJECT H2O-128-RI-MP2-TZ
  RUN_TYPE ENERGY
&END GLOBAL

&FORCE_EVAL
  METHOD Quickstep
  &DFT
    BASIS_SET_FILE_NAME ./BASIS_H2O
    POTENTIAL_FILE_NAME ./POTENTIAL_H2O
    WFN_RESTART_FILE_NAME ./H2O-128-PBE-TZ-RESTART.wfn
    &MGRID
      CUTOFF 800
      REL_CUTOFF 50
    &END MGRID
    &QS
      EPS_DEFAULT 1.0E-12
    &END QS
    &SCF
      EPS_SCF 1.0E-6
      MAX_SCF 30
      SCF_GUESS RESTART
      &OT
        MINIMIZER CG
        PRECONDITIONER FULL_ALL
      &END OT
      &OUTER_SCF
        EPS_SCF 1.0E-6
        MAX_SCF 20
      &END OUTER_SCF
      &PRINT
        &RESTART OFF
        &END RESTART
      &END PRINT
    &END SCF
    &XC
      &HF
        FRACTION 1.0
        &INTERACTION_POTENTIAL
          CUTOFF_RADIUS 6.0
          POTENTIAL_TYPE TRUNCATED
          T_C_G_DATA ./t_c_g.dat
        &END INTERACTION_POTENTIAL
        &MEMORY
          MAX_MEMORY 16384
        &END MEMORY
        &SCREENING
          EPS_SCHWARZ 1.0E-8
          SCREEN_ON_INITIAL_P TRUE
        &END SCREENING
      &END HF
      &WF_CORRELATION
        MEMORY 1200
        NUMBER_PROC 1
        &INTEGRALS
          &WFC_GPW
            CUTOFF 300
            EPS_FILTER 1.0E-12
            EPS_GRID 1.0E-8
            REL_CUTOFF 50
          &END WFC_GPW
        &END INTEGRALS
        &RI_MP2
        &END RI_MP2
      &END WF_CORRELATION
      &XC_FUNCTIONAL NONE
      &END XC_FUNCTIONAL
    &END XC
  &END DFT
  &SUBSYS
    &CELL
      ABC 15.6404 15.6404 15.6404
    &END CELL
    &KIND H
      BASIS_SET cc-TZ
      BASIS_SET RI_AUX RI-cc-TZ
      POTENTIAL GTH-HF-q1
    &END KIND
    &KIND O
      BASIS_SET cc-TZ
      BASIS_SET RI_AUX RI-cc-TZ
      POTENTIAL GTH-HF-q6
    &END KIND
    &TOPOLOGY
      COORD_FILE_FORMAT XYZ
      COORD_FILE_NAME ./H2O-128.xyz
    &END TOPOLOGY
  &END SUBSYS
&END FORCE_EVAL

All calculations run for this scaling tests were using 32 MPI ranks per node and 8 OpenMP threads per rank. The smallest amount of nodes necessary to run this calculation is 8.

Number of nodes	Wall time (s)	Speedup	Efficiency
8	2037.0	1.00	1.00
16	1096.2	1.85	0.92
32	611.5	3.33	0.83
64	410.5	4.96	0.62
128	290.9	7.00	0.43

MP2 calculations scale well on GH200, up to a large number of nodes (\(> 50\%\) efficiency with 64 nodes).

Scaling of QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ benchmark

The QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-T benchmark is a RPA energy calculation, traditionally used to benchmark the performance of the COSMA library. It a very large calculation, which requires at least 8 GH200 nodes to run. The calculations were run with 16 MPI ranks per node and 16 OpenMP threads per rank. For RPA workloads, a higher ratio of threads per rank were beneficial.

Number of nodes	Wall time (s)	Speedup	Efficiency
8	575.4	1.00	1.00
16	465.8	1.23	0.61
32	281.1	2.04	0.51
64	205.3	2.80	0.35
128	185.8	3.09	0.19

This RPA input scales well until 32 GH200 nodes.

Running on Eiger¶

On Eiger, a similar sbatch script can be used:

run_cp2k.sh

#!/bin/bash -l
#SBATCH --job-name=cp2k-job
#SBATCH --time=00:30:00 (1)
#SBATCH --nodes=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=32 (2)
#SBATCH --cpus-per-task=4 (3)
#SBATCH --account=<ACCOUNT>
#SBATCH --hint=nomultithread
#SBATCH --hint=exclusive
#SBATCH --constraint=mc
#SBATCH --uenv=<CP2K_UENV>
#SBATCH --view=cp2k

export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)!

ulimit -s unlimited
srun --cpu-bind=socket cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>

Time format: HH:MM:SS
Number of MPI ranks per node
Number of CPUs per MPI ranks
OpenBLAS spawns an extra thread, therefore it is necessary to set OMP_NUM_THREADS to SLURM_CPUS_PER_TASK - 1 for good performance. With Intel MKL, this is not necessary and one can set OMP_NUM_THREADS to SLURM_CPUS_PER_TASK.
Change to your project account name
Change <CP2K_UENV> to the name (or path) of the actual CP2K uenv you want to use
Change <PATH_TO_CP2K_DATA_DIR> to the actual path to the CP2K data directory
Change <CP2K_INPUT> and <CP2K_OUTPUT> to the actual input and output files

Warning

The --cpu-bind=socket option is necessary to get good performance.

Running regression tests

If you want to run CP2K regression tests with the CP2K executable provided by the uenv, make sure to use the version of the regression tests corresponding to the version of CP2K provided by the uenv. The regression test data is sometimes adjusted, and using the wrong version of the regression tests can lead to test failures.

Building CP2K from Source¶

Warning

The following installation instructions are up-to-date with the latest version of CP2K provided by the uenv. That is, they work when manually compiling the CP2K source code corresponding to the CP2K version provided by the uenv. They are not necessarily up-to-date with the latest version of CP2K available on the master branch.

If you are trying to build CP2K from source, make sure you understand what is different in master compared to the latest version of CP2K provided by the uenv.

The CP2K uenv provides all the dependencies required to build CP2K from source, with several optional features enabled. You can follow these steps to build CP2K from source:

uenv start --view=develop <CP2K_UENV> # (1)!

cd <PATH_TO_CP2K_SOURCE> # (2)!

mkdir build && cd build
CC=mpicc CXX=mpic++ FC=mpifort cmake \
    -GNinja \
    -DCMAKE_CUDA_HOST_COMPILER=mpicc \ # (3)!
    -DCP2K_USE_LIBXC=ON \
    -DCP2K_USE_LIBINT2=ON \
    -DCP2K_USE_SPGLIB=ON \
    -DCP2K_USE_ELPA=ON \
    -DCP2K_USE_SPLA=ON \
    -DCP2K_USE_SIRIUS=ON \
    -DCP2K_USE_COSMA=ON \
    -DCP2K_USE_PLUMED=ON \
    -DCP2K_USE_DFTD4=ON \
    -DCP2K_USE_DLAF=ON \
    -DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \ # (4)!
    ..

ninja -j 32

Start the CP2K uenv and load the develop view (which provides all the necessary dependencies)
Go to the CP2K source directory
The H100 option enables the sm_90 architecture for the CUDA backend

Eiger: libxsmm

On x86 we deploy with libxmm. Add -DCP2K_USE_LIBXSMM=ON to the CMake invocation to use libxsmm.

Eiger: Intel MKL (before cp2k@2025.1)

On x86 we deployed with intel-oneapi-mkl before cp2k@2025.1. If you are using a pre-cp2k@2025.1 uenv, add -DCP2K_SCALAPACK_VENDOR=MKL to the CMake invocation to find MKL.

CUDA architecture for cp2k@2024.1 and earlier

cp2k@2024.1 (and earlier) does not support compiling for cuda_arch=90. Use -DCP2K_WITH_GPU=A100 instead, which enables the sm_80 architecture.

See manual.cp2k.org/CMake for more details.

Known issues¶

DLA-Future¶

The cp2k/2025.1 uenv provides CP2K with DLA-Future support enabled. The DLA-Future library is initialized even if you don't explicitly ask to use it. This can lead to some surprising warnings and failures described below.

`CUSOLVER_STATUS_INTERNAL_ERROR` during initialization¶

If you are heavily over-subscribing the GPU by running multiple ranks per GPU, you may encounter the following error:

created exception: cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call)
terminate called after throwing an instance of 'pika::cuda::experimental::cusolver_exception'
what(): cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call)

The reason is that too many cuSOLVER handles are created. If you don't need DLA-Future, you can manually set the number of BLAS and LAPACK handlers to 1 by setting the following environment variables:

DLAF_NUM_GPU_BLAS_HANDLES=1
DLAF_NUM_GPU_LAPACK_HANDLES=1

Warning about pika only using one worker thread¶

When running CP2K with multiple tasks per node and only one core per task, the initialization of DLA-Future may trigger the following warning:

The pika runtime will be started with only one worker thread because the
process mask has restricted the available resources to only one thread. If
this is unintentional make sure the process mask contains the resources
you need or use --pika:ignore-process-mask to use all resources. Use
--pika:print-bind to print the thread bindings used by pika.

This warning is triggered because the runtime used by DLA-Future, pika, should typically be used with more than one thread and indicates a configuration mistake. However, if you are not using DLA-Future, the warning is harmless and can be ignored. The warning cannot be silenced.

DBCSR GPU scaling¶

On the GH200 architecture, it has been observed that the GPU accelerated version of DBCSR does not perform optimally in some cases. For example, in the QS/H2O-1024 benchmark above, CP2K does not scale well beyond 2 nodes. The CPU implementation of DBCSR does not suffer from this. A workaround was implemented in DBCSR, in order to switch GPU acceleration on/off with an environment variable:

export DBCSR_RUN_ON_GPU=0

While GPU acceleration is very good on few nodes, the CPU implementation scales better. Therefore, for CP2K jobs running on many nodes, it is worth investigating the use of the DBCSR_RUN_ON_GPU environment variable.

Some niche application cases such as the QS_low_scaling_postHF benchmarks only run efficiently with the CPU version of DBCSR. Generally, if the function dbcsr_multiply_generic takes a significant portion of the timing report (at the end of the CP2K output file), it is worth investigating the effect of the DBCSR_RUN_ON_GPU environment variable.

CUDA grid backend with high angular momenta basis sets¶

The CP2K grid CUDA backend is currently buggy on Alps. Using basis sets with high angular momenta (\(l \ge 3\)) result in slow calculations, especially for force calculations with meta-GGA functionals.

As a workaround, you can disable CUDA acceleration for the grid backend:

&GLOBAL
    &GRID
        BACKEND CPU
    &END GRID
&END GLOBAL

Fix available upon request

A fix for this issue for the HIP backend is currently being tested by CSCS engineers. If you would like to test it, please contact us and we will be able to provide the source code. The fix will eventually land on the upstream CP2K repository.

CP2K¶

Dependencies¶

Running CP2K¶

Running on the HPC platform¶

Running on Eiger¶

Building CP2K from Source¶

Known issues¶

DLA-Future¶

CUSOLVER_STATUS_INTERNAL_ERROR during initialization¶

Warning about pika only using one worker thread¶

DBCSR GPU scaling¶

CUDA grid backend with high angular momenta basis sets¶

`CUSOLVER_STATUS_INTERNAL_ERROR` during initialization¶