MatchMaker infrastructure issues

Postmortem

April 16, 2026 at 4:39 PM (GMT+0)

Postmortem

April 16, 2026 at 4:39 PM (GMT+0)

Executive summary

On 15 April, a network switch failure at our hosting provider’s data centre caused a disruption impacting all MatchMakers. Customers were unable to establish new VPN connections. Existing VPN connections were not impacted and remained active throughout the incident.

The service degradation lasted approximately five hours, from 18:50 to 00:10 (UTC+3), after which the platform resumed normal operation.

The issue was detected via automated alerts shortly after onset. On‑call engineers responded immediately, escalated the incident to the hosting provider, and worked to stabilise the service. Customer updates were provided via the public status page and email communications.

The root cause was a network switch failure at the hosting provider’s data centre. This led to excessive reconnection attempts that exceeded other servers capacity for new connections. To prevent recurrence, we are eliminating single points of failure through duplicated connectivity, increasing system capacity and redundancy, and improving failover behaviour and escalation processes with the hosting provider.

Root Cause Analysis report

All systems were running normally until the incident.

Fault

At 18:52 (UTC+3), a network switch at the hosting data centre failed, isolating a critical MatchMaker server and blocking all traffic to and from it. Clients attempted to reconnect to other MatchMaker (MM) servers, triggering a sudden load spike. Those servers became unresponsive in sequence, resulting in all MM servers becoming unavailable.

Impact
Following the switch failure, remaining MM servers received a surge of reconnection attempts and progressively hung. Active MM connections dropped until recovery commenced about five hours later, once the isolated server was restored.

Customers could not start new VPN connections during the outage. Existing connections stayed up. Impact was greater for US-based users due to local business hours.

Timeline

18:52 – Network switch failed; critical MatchMaker server isolated (UTC+3).

19:00 – Monitoring alert to on‑call; investigation started (UTC+3).

19:27 – Public status page reported widespread MatchMaker unavailability (UTC+3).

22:43 – Customer notification sent via email (UTC+3).

23:08 – Network configuration corrected on the critical server (UTC+3).

23:56 – All servers accepting connections again (UTC+3).
00:10 – Connections stabilised; public status page updated (UTC+3).
00:16 – Recovery notification emailed to customers (UTC+3).

2026-04-16 (UTC+3)
08:45 – Public status page incident closed (UTC+3).

Follow-up

Capacity & resilience: Add additional MM capacity and headroom; improve load balancing.
Failover: Improve network connectivity; validate automatic failover under load.
Process & comms: Speed up external comms and clarify roles and processes; ensure timely status page and email updates.

Resolved

April 16, 2026 at 5:46 AM (GMT+0)

Resolved

April 16, 2026 at 5:46 AM (GMT+0)

This incident has been resolved. The outage was caused by a network failure at our data center provider, which took a critical server offline. A full RCA is in the works and due for delivery later today.

Monitoring

April 15, 2026 at 9:11 PM (GMT+0)

Monitoring

April 15, 2026 at 9:11 PM (GMT+0)

Issue should now be resolved and we are monitoring the aftermath

Identified

April 15, 2026 at 8:37 PM (GMT+0)

Identified

April 15, 2026 at 8:37 PM (GMT+0)

We have identified the root cause and are working on restoring all connectivity

Investigating

April 15, 2026 at 4:57 PM (GMT+0)

Investigating

April 15, 2026 at 4:57 PM (GMT+0)

Issues with MatchMaker infrastructure, impacting new connection creation

Tosi Platform - MatchMaker infrastructure issues – Incident details

All systems operational

Executive summary