Back to overview
Maintenance

Emergency node maintenance in US-East-1

Nov 03 at 11:00pm EST  –  Nov 03 at 11:27pm EST
Affected services
US EAST 1
Metrics Virginia

Resolved
Nov 03 at 11:27pm EST

The rollout is complete. Mild degradation in start up time was limited between 11:08 - 11:13. Please report any issues to support@cerebrium.ai or elijah@cerebrium.ai.

Updated
Nov 03 at 11:10pm EST

The fix has been applied and we are rolling out new nodes

Created
Nov 03 at 11:00pm EST

A critical error in the mechanism GPU devices use to attach to containers is affecting several workloads on the platform, causing NVML to show "Device not found" when calling nvidia-smi or attempting to use the GPU (Mentioned in https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error). This maintenance will update all GPU nodes to use the CDI, as well as a few container runtime upgrades.