ECSO CLOUD

On May 8, 2026, ECSO CLOUD experienced its first major service disruption of the year. The affected host system, col1-compute-amd-101, encountered a critical failure that impacted several Generation 1 Cloud Servers. This report provides a detailed technical breakdown of the incident, our response, and the measures taken to prevent future occurrences.

System Overview

Host 101 is one of our compute resources located in Cologne. It is equipped with an AMD CPU, 512 GB RAM, and 8 TB NVMe storage. At the time of the incident, the system was hosting over 50 active customer environments.

May 8, 2026: Incident Discovery & Initial Response

10:58 – Customer instances on Host 101 became unreachable. Initial monitoring did not trigger an alert because the host’s primary IP remained responsive and resource allocation appeared normal.
13:11 – The first manual report was received via a Key Account Manager. The customer was advised to open a formal ticket to initiate a technical investigation.
13:42 – A second formal ticket was received, confirming a broader issue.
13:46 – 1st Level Support completed initial diagnostics. After failing to establish a NoVNC connection, the case was escalated to 2nd Level Support.
14:52 – 2nd Level Support identified a deep-seated system error. After SSH access failed, a hard reboot was initiated. However, the system hung at the Grub bootloader.
15:02 – 3rd Level Support was mobilized for an on-site investigation.
15:21 – Root Cause Identified: A critical kernel conflict (v6.12.63 Debian) prevented the RAID controller from recognizing the NVMe drives. This was triggered by a previous disruption at DE-NIC (the .de registry). An automated update process failed mid-execution due to the DNS outage, leading the system to attempt a self-correction with an incompatible kernel version.
16:13 – All customers with active Disaster Recovery plans were successfully migrated to alternative host systems, restoring their services.

May 9, 2026: Recovery & Troubleshooting

01:51 – Technical teams successfully rolled back the kernel version and cleared legacy artifacts. Although the system booted, network connectivity could not be established. Following six hours of intensive labor, the team took a scheduled break.
14:53 – 3rd Level Support resumed troubleshooting the network stack.
23:28 – A comprehensive hardware audit (CPU and RAM) returned no errors. However, bit-level inconsistencies were found on the storage layer, likely caused by the abrupt system halt combined with incompatible drivers. These were corrected after a full system backup.

May 10, 2026: Final Resolution

15:42 – The IT Director identified the final bottleneck: a bridge configuration mismatch in the OSV database was causing the system to hang during the systemd startup sequence. After clearing and re-initializing the bridges, the RAID and partitions were verified as healthy.
16:19 – Final Clearance: Internal testing confirmed full stability. The official status was updated to "Resolved" at 16:23.

Lessons Learned & Future Prevention

This incident highlighted several vulnerabilities in our current infrastructure and internal workflows:

1. Monitoring Gaps: Our legacy monitoring was focused on host availability rather than guest-level health. We have now fully integrated Netdata across all services to provide granular, real-time insights that will detect similar failures instantly.
2. Internal Coordination: The transition between 2nd and 3rd Level Support revealed a need for better diagnostic protocols. Moving forward, we will prioritize root-cause analysis over immediate "trial-and-error" recovery attempts.
3. Communication: We recognize that our communication during this period was insufficient. Status updates were delayed as staff focused entirely on the technical investigation. We are revising our incident communication policy to ensure customers are kept informed in real-time, regardless of the investigative workload.

We sincerely apologize for the disruption and are committed to using this experience to build a more resilient ECSO CLOUD.

Written By

Robin Holl

Founder & Chief Executive Officer

Want to know more?

You can learn more about us on our about page.

More about us