Health checks¶
Internally, FirecREST performs a periodic health check of all underlying services that needs to work correctly.
These periodic tests not only provide the users with updated information about the current status of the HPC Center infra, but help FirecREST to run faster and more efficiently.
As you can see in the figure above, when enabled the periodic health check will update the status of
- HPC cluster connectivity (via SSH)
- Workload manager and scheduler (via SSH or API calls, depending the service)
- Filesystem availability (read only)
- Object storage availabitity (S3 interface is reachable or not)
Status reports¶
When consulting the API endpoint /status/systems
you can see the status of health checks on the response JSON. The filed servicesHealth
is composed of a list of serviceTypes
as listed above.
The common format includes:
serviceType
: can be (1)ssh
, (2)scheduler
, (3)filesystem
, and (4)s3
.lastChecked
: timestamp of the last time the periodic test was performedlatency
: is used to establish if the service is healthy or not (depends on the configuration of FirecREST)healthy
:true
orfalse
message
: in case of error, it describes why the probe failed
Health check response for serviceType: ssh
{
"systems":[
{
"name": "system01",
(...)
"servicesHealth": [
(...)
{
"serviceType":"ssh",
"lastChecked":"2025-04-02T07:48:48.367139Z",
"latency":0.4212830066680908,
"healthy":true,
"message":null
},
(...)
]
},
{
"name": "system02",
(...)
{
"serviceType": "ssh",
"lastChecked": "2025-04-02T07:48:49.082887Z",
"latency": 1.1373629570007324,
"healthy": false,
"message": "Too many authentication failures"
},
}
]
}
Health check response for serviceType: scheduler
{
"systems":[
{
"name": "system01",
(...)
"servicesHealth": [
{
"serviceType":"scheduler",
"lastChecked":"2025-04-01T00:01:00.000000Z",
"latency":1.1002883911132812,
"healthy":true,
"message":null,
"nodes": {
"available":130,
"total":143
}
},
]
},
{
"name": "system02",
(...)
"servicesHealth": [
{
{
"serviceType":"scheduler",
"lastChecked":"2025-04-01T00:01:00.000000Z",
"latency":0.02857375144958496,
"healthy":false,
"message":"ClientConnectorError: Cannot connect to host",
"nodes":null
}
}
]
}
]
}
Health check response for serviceType: filesystem
{
"systems":[
{
"name": "system01",
(...)
"servicesHealth": [
{
"serviceType":"filesystem",
"lastChecked":"2025-04-01T00:01:00.000000Z",
"latency":1.3535213470458984,
"healthy":true,
"message":null,
"path":"/path/to/filesystem01"
},
{
"serviceType": "filesystem",
"lastChecked":"2025-04-01T00:01:00.000000Z",
"latency":0.9655811786651611,
"healthy":false,
"message":"Too many authentication failures",
"path":"/path/to/filesystem02"
}
]
}
]
}
Health check response for serviceType: s3
Improving command execution¶
Health checks not only provide information about the status of the underlying infrastructure of the HPC center; they enable FirecREST to predict that the command executed in the cluster has a very high probability of being executed as expected.
When enabled, FirecREST will prevent execute commands on unhealthy services