Tosi Platform - MatchMaker infrastructure issues – Incident details

MatchMaker infrastructure issues

Resolved
Major outage
Started about 1 month agoLasted about 13 hours

Affected

MatchMaker

Major outage from 4:57 PM to 9:11 PM, Operational from 9:11 PM to 5:46 AM

FI1

Major outage from 4:57 PM to 9:11 PM, Operational from 9:11 PM to 5:46 AM

FI2

Major outage from 4:57 PM to 9:11 PM, Operational from 9:11 PM to 5:46 AM

DE1

Major outage from 4:57 PM to 9:11 PM, Operational from 9:11 PM to 5:46 AM

DE2

Major outage from 4:57 PM to 9:11 PM, Operational from 9:11 PM to 5:46 AM

Updates
  • Postmortem
    Postmortem

    Executive summary

    On 15 April, a network switch failure at our hosting provider’s data centre caused a disruption impacting all MatchMakers. Customers were unable to establish new VPN connections. Existing VPN connections were not impacted and remained active throughout the incident.

    The service degradation lasted approximately five hours, from 18:50 to 00:10 (UTC+3), after which the platform resumed normal operation.

    The issue was detected via automated alerts shortly after onset. On‑call engineers responded immediately, escalated the incident to the hosting provider, and worked to stabilise the service. Customer updates were provided via the public status page and email communications.

    The root cause was a network switch failure at the hosting provider’s data centre. This led to excessive reconnection attempts that exceeded other servers capacity for new connections. To prevent recurrence, we are eliminating single points of failure through duplicated connectivity, increasing system capacity and redundancy, and improving failover behaviour and escalation processes with the hosting provider.


    Root Cause Analysis report 

    All systems were running normally until the incident.

    Fault

    At 18:52 (UTC+3), a network switch at the hosting data centre failed, isolating a critical MatchMaker server and blocking all traffic to and from it. Clients attempted to reconnect to other MatchMaker (MM) servers, triggering a sudden load spike. Those servers became unresponsive in sequence, resulting in all MM servers becoming unavailable.

    Impact
    Following the switch failure, remaining MM servers received a surge of reconnection attempts and progressively hung. Active MM connections dropped until recovery commenced about five hours later, once the isolated server was restored.

    Customers could not start new VPN connections during the outage. Existing connections stayed up. Impact was greater for US-based users due to local business hours.

    Timeline

    18:52 – Network switch failed; critical MatchMaker server isolated (UTC+3).

    19:00 – Monitoring alert to on‑call; investigation started (UTC+3).

    19:27 – Public status page reported widespread MatchMaker unavailability (UTC+3).

    22:43 – Customer notification sent via email (UTC+3).

    23:08 – Network configuration corrected on the critical server (UTC+3).

    23:56 – All servers accepting connections again (UTC+3).
    00:10 – Connections stabilised; public status page updated (UTC+3).
    00:16 – Recovery notification emailed to customers (UTC+3).

    2026-04-16 (UTC+3)
    08:45 – Public status page incident closed (UTC+3).

     

    Follow-up

    • Capacity & resilience: Add additional MM capacity and headroom; improve load balancing.

    • Failover: Improve network connectivity; validate automatic failover under load.

    • Process & comms: Speed up external comms and clarify roles and processes; ensure timely status page and email updates.

  • Resolved
    Resolved

    This incident has been resolved. The outage was caused by a network failure at our data center provider, which took a critical server offline. A full RCA is in the works and due for delivery later today.

  • Monitoring
    Monitoring

    Issue should now be resolved and we are monitoring the aftermath

  • Identified
    Identified

    We have identified the root cause and are working on restoring all connectivity

  • Investigating
    Investigating

    Issues with MatchMaker infrastructure, impacting new connection creation