Emergency node maintenance ...

Maintenance

Emergency node maintenance in US-East-1

Nov 4, 2025 at 4:00am UTC – Nov 4, 2025 at 4:27am UTC

Affected services

US EAST 1

Metrics (Global)

Resolved
Nov 4, 2025 at 4:27am UTC

The rollout is complete. Mild degradation in start up time was limited between 11:08 - 11:13. Please report any issues to support@cerebrium.ai or elijah@cerebrium.ai.

Updated
Nov 4, 2025 at 4:10am UTC

The fix has been applied and we are rolling out new nodes

Created
Nov 4, 2025 at 4:00am UTC

A critical error in the mechanism GPU devices use to attach to containers is affecting several workloads on the platform, causing NVML to show "Device not found" when calling nvidia-smi or attempting to use the GPU (Mentioned in https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error). This maintenance will update all GPU nodes to use the CDI, as well as a few container runtime upgrades.