Major incident: Major Outage affecting all systems

Post-mortem

Incident Summary
On 7 January 2026, starting at 15:00 CET, a major service outage occurred affecting the web-client in the EU region. The root cause was identified as an unexpected load peak triggered by a specific data-heavy endpoint used for bulk operations. This led to resource exhaustion for the underlying services supporting the application, preventing users from logging in or accessing the web-client interface. The incident was mitigated by 15:20 CET through infrastructure scaling, and a permanent software fix was implemented later that evening.

Impact on Users
This incident resulted in a total outage of the web-client for all customers in the EU-West region for approximately 25 minutes. During this window, users were unable to log into the application or perform any tasks within the interface. While the US region remained unaffected, all EU-based users experienced a complete disruption of access to the web-client, including management tools and member-facing platforms.

Our Response
15:00 CET: Alerts were triggered as reports surfaced of users being unable to log into the web-client.

15:02 CET: An incident coordination team was mobilized to investigate why the web-client was failing to process requests.

15:03 CET: Initial analysis identified that one of the underlying services of the web-client was struggling to maintain database connections and were experiencing significant performance bottlenecks.

15:15 CET: Engineering teams performed a rolling restart of the affected service replicas and manually increased capacity to handle the traffic surge.

15:20 CET: System stability was restored, and successful logins to the web-client were confirmed by customers.

16:00 CET: After analyzing traffic patterns, a specific legacy endpoint called by the web-client was identified as the catalyst for the load spike when used with large datasets.

16:05 CET: A plan was coordinated to disable the problematic endpoint and deploy an optimized version for the web-client.

21:02 CET: A permanent fix was implemented and deployed, ensuring the web-client remains stable during bulk operations.

Resolution
The incident was resolved through immediate manual scaling of the infrastructure supporting the web-client and the recycling of exhausted service instances. This provided the necessary overhead to process the queued requests and restore application access. The long-term resolution involved replacing the inefficient endpoint with a more robust, optimized version to prevent future resource exhaustion within the web-client ecosystem.

Lessons Learned
This incident highlighted the risks associated with legacy components within the web-client that lack strict resource constraints when processing large payloads. While manual intervention successfully restored service, the event underscores the need for more aggressive auto-scaling triggers to protect the web-client during traffic spikes. Moving forward, we will conduct a comprehensive audit of all endpoints utilized by the web-client to ensure they meet current performance standards and implement stricter safeguards against unexpected loads.

Our engineering team will review the issue and implement additional measures to prevent similar incidents in the future.

If you continue to experience any problems, please open a ticket with our support team.

We apologize for any inconvenience caused.

January 9, 2026 · 13:42 CET

Monitoring

Full service functionality has been recovered. The engineering team is monitoring the situation.

January 7, 2026 · 15:57 CET

Update

The services are available again.
As the underlying root cause was not found yet, the team is continuing it’s investigation.

January 7, 2026 · 15:46 CET

Update

The engineering team has implemented interim mitigation measures to restore service availability.”

January 7, 2026 · 15:39 CET

Investigating

The team is continuing the investigation.

January 7, 2026 · 15:24 CET

Resolved

A permanent fix was applied which resolved the underlying root cause.

January 7, 2026 · 15:20 CET

Investigating

We are currently experiencing a complete service disruption affecting all systems. Access to applications and services is currently unavailable. Our team is working with highest priority to restore services.

January 7, 2026 · 15:04 CET

Major Outage affecting all systems

Updates