On the 16th of November, around 2 AM UTC, a part of the data center at our cloud provider, OVH, faced an electrical failure.
The data center room affected by the outage impacted our core database, which in turn prevented the Events Capture API from recording new website events. The database is required to assert the validity of the traffic before it is recorded.
The OVH technician restored service in the rack hosting our units around 5:30 AM UTC. Our services instantly came back online and fully recovered within the next 15 minutes.
In the affected time window, between 2:00 AM UTC and 5:30 AM UTC, some or all of the website tracking events sent to Wide Angle Analytics were not captured properly.
We will improve incident reporting, as we observed a gap in our monitoring infrastructure compared to the actual service state. While monitoring captured the incident, due to an issue in the escalation procedure, a responsible engineer was not notified in time.
The architecture of services with a critical dependency on the database will be reviewed, and an HA/FT (High Availability/Fault Tolerance) solution will be implemented to prevent such a failure in the future.