Emergency node maintenance in US-East-1
Resolved
Nov 4, 2025 at 4:27am UTC
The rollout is complete. Mild degradation in start up time was limited between 11:08 - 11:13. Please report any issues to support@cerebrium.ai or elijah@cerebrium.ai.
Affected services
Updated
Nov 4, 2025 at 4:10am UTC
The fix has been applied and we are rolling out new nodes
Affected services
Created
Nov 4, 2025 at 4:00am UTC
A critical error in the mechanism GPU devices use to attach to containers is affecting several workloads on the platform, causing NVML to show "Device not found" when calling nvidia-smi or attempting to use the GPU (Mentioned in https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error). This maintenance will update all GPU nodes to use the CDI, as well as a few container runtime upgrades.
Affected services