The Root Cause

On the 16th of November, around 2 AM UTC, a part of the data center at our cloud provider, OVH, faced an electrical failure.

The data center room affected by the outage impacted our core database, which in turn prevented the Events Capture API from recording new website events. The database is required to assert the validity of the traffic before it is recorded.

Resolution

The OVH technician restored service in the rack hosting our units around 5:30 AM UTC. Our services instantly came back online and fully recovered within the next 15 minutes.

Customer Impact

In the affected time window, between 2:00 AM UTC and 5:30 AM UTC, some or all of the website tracking events sent to Wide Angle Analytics were not captured properly.

Next Steps

Short Term

We will improve incident reporting, as we observed a gap in our monitoring infrastructure compared to the actual service state. While monitoring captured the incident, due to an issue in the escalation procedure, a responsible engineer was not notified in time.

Long Term

The architecture of services with a critical dependency on the database will be reviewed, and an HA/FT (High Availability/Fault Tolerance) solution will be implemented to prevent such a failure in the future.

Posted Nov 16, 2023 - 06:53 CET

Resolved

This incident has been resolved.

Posted Nov 16, 2023 - 06:40 CET

Update

The incident has now been resolved.

Posted Nov 16, 2023 - 06:40 CET

Update

The data ingestion has been restored. The reporting system is currently recovering.

Posted Nov 16, 2023 - 06:35 CET

Monitoring

The affected section of the data centre is being restored. Our services are coming online. We are monitoring the status.

Posted Nov 16, 2023 - 06:31 CET

Update

We are continuing to investigate this issue.

Posted Nov 16, 2023 - 06:04 CET

Investigating

Due to the ongoing critical failure with our cloud provider, we are experiencing downtime of a few critical components. Temporarily no events are consumed.

Posted Nov 16, 2023 - 06:04 CET

This incident affected: Website and Events API.