Emergency node maintenance in US-East-1
Resolved
Nov 03 at 11:27pm EST
The rollout is complete. Mild degradation in start up time was limited between 11:08 - 11:13. Please report any issues to support@cerebrium.ai or elijah@cerebrium.ai.
Affected services
Updated
Nov 03 at 11:10pm EST
The fix has been applied and we are rolling out new nodes
Affected services
Created
Nov 03 at 11:00pm EST
A critical error in the mechanism GPU devices use to attach to containers is affecting several workloads on the platform, causing NVML to show "Device not found" when calling nvidia-smi or attempting to use the GPU (Mentioned in https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error). This maintenance will update all GPU nodes to use the CDI, as well as a few container runtime upgrades.
Affected services