Official Root Cause Analysis (RCA) for SentinelOne Global Service Interruption

UPDATE 3 (20:47 UTC, May 31, 2025): A Root Cause Analysis into the May 29, service disruption has been completed and can be found below.

Official RCA for May 29 Service Disruption

On May 29, 2025, SentinelOne experienced a global service disruption affecting multiple customer-facing services. During this period, customer endpoints remained protected, but security teams were unable to access the management console and related services, which significantly impacted their ability to manage their security operations and access important data. We apologize for the disruption caused by this service interruption.

The root cause of the disruption was a software flaw in an infrastructure control system that removed critical network routes, causing widespread loss of network connectivity within the SentinelOne platform. It was not a security-related event. The majority of SentinelOne services experienced full or partial downtime due to this sudden loss of network connectivity to critical components in all regions.

We’d like to assure our commercial customers that their endpoints were protected throughout the duration of the service disruption and that no SentinelOne security data was lost during the event. Protected endpoint systems themselves did not experience downtime due to this incident. A core design principle of the SentinelOne architecture is to ensure protection and prevention capabilities continue uninterrupted without constant cloud connectivity or human dependency for detection and response – even in the case of service interruptions, of any kind, including events like this one.

It is also worth noting that this incident did not impact Federal environments (i.e. customers on GovCloud) in any way. Federal customers were alerted in the response plan for situational awareness and full transparency.

Below is a comprehensive summary of the incident, our response, and the actions we’re taking to address the root cause while strengthening future preparedness.

Reason and Impact

On May 29, the majority of SentinelOne services experienced full or partial downtime due to a sudden and widespread loss of network connectivity to essential components across all regions. This service interruption was triggered when critical network routes and DNS resolver rules, necessary for connecting infrastructure, were deleted as a result of a software flaw in an automated process.

SentinelOne is currently in the process of transitioning our production systems to a new cloud architecture built on Infrastructure-as-Code (IaC) principles. The deletion occurred after a soon-to-be-deprecated (i.e. outgoing) control system was triggered by the creation of a new account. A software flaw in the control system’s configuration comparison function misidentified discrepancies and applied what it believed to be the appropriate configuration state, overwriting previously established network settings. As this outgoing control system is no longer our source of truth for network configurations, it restored an empty route table.

This issue stemmed from historical practices where network configurations were managed manually outside of IaC, a process already being phased out as workloads transition to the more modern, code-driven infrastructure approach. The combined effects of manual configuration and a latent software flaw in the outgoing tool ultimately led to the service interruption.

During the service interruption, customers were unable to log in to the SentinelOne management console and access their SentinelOne data or manage their SentinelOne services. Programmatic access to our services was also interrupted. Unified Asset Management/Inventory and Identity services were also unavailable, preventing customers from viewing assets and vulnerabilities or accessing identity consoles. MDR alerts and data ingestion from third-party services may have been affected.

Event Timeline

All timestamps are in UTC.

On May 29, 2025, at 13:37, a software flaw in an outgoing infrastructure control system triggered an automatic function that removed critical network routes.

At 13:50, SentinelOne Engineering began receiving alerts from monitoring that network connectivity to Customer Management Consoles and backend cloud services was failing. SentinelOne Engineering began investigating the issue.

At 13:55, customer reports of service disruptions began to come into SentinelOne Support.

At 14:27, SentinelOne Engineering identified missing routes on Transit Gateways in SentinelOne’s production infrastructure. The Initial Incident Response Task Force began restoring route tables and core connectivity, evaluating customer impact and potential resolution time. Additionally, the team started ongoing efforts to notify customers of the issue.

At 14:50, an announcement was published to the SentinelOne Customer Portal on our Cases page, recognizing the ongoing incident and our investigation.

At 15:07, an additional service interruption announcement was published on the SentinelOne Customer Portal Homepage for broader reach.

At 16:30, additional resources were added to the Incident Response Task Force to further support alignment on all ongoing initiatives across all relevant departments and ensure core response functions received updates on the investigation, impact, and potential resolution.

At 17:36, an official response was posted on SentinelOne’s subreddit addressing related threads and questions pertaining to the service interruption.

At 17:46, a new announcement was published to the SentinelOne Customer Portal highlighting our team’s ongoing efforts to restore services and our initial root cause analysis suggesting this was not a security incident.

At 18:27, an email was sent to all SentinelOne customers addressing the ongoing incident and our team’s efforts to restore services, and a blog post was published sharing the same message. An announcement was also published on the Partner Portal.

At 20:05, SentinelOne Engineering confirmed that the manual restoration of all routes was completed and began validating customer console access.

At 20:11, a new announcement was published on the SentinelOne Customer Portal and the Partner Portal, highlighting that console access was restored and that we were working to validate the health of all services.

At 20:50, a second email was sent to all customers and partners, sharing the news that access to consoles had been restored and services were coming back online.

Validation work to confirm that most services were fully restored and available was completed by 23:44, but there was a backlog of data that needed to be ingested for customer consumption by back-end services.

By 10:00 on May 30, 2025, all the data ingestion backlog was burned down, and service was fully restored at that time. Additionally, Threat Detection & Response services were confirmed to be fully operational at this time.

Causes and Corrective Steps

Root Cause: The outgoing cloud management function contained a flaw that restored an empty backup of the AWS Transit Gateway route table.

Response: The SentinelOne cloud engineering team is auditing EventBridge and other equivalent sources of automatically triggered functions to ensure outgoing control code cannot be triggered while we complete our transition to the new cloud architecture.

Root Cause: Customer systems are split between our outgoing and new Infrastructure-as-Code (IaC) cloud architectures, creating risk and slowing recovery efforts.

Response: SentinelOne is accelerating our transition efforts to complete porting all systems and customers to the new IaC customer architecture.

Contributing Cause: Recovery was hampered by the need to manually restore some AWS Transit Gateway routes.

Response: SentinelOne teams have backed up the current state of all TGWs and are improving and testing recovery automation.

Contributing Cause: Communication with customers and partners was hampered by the lack of a central, well-known location for system status that is not tied to production AWS infrastructure. Additionally, due to internal process gaps in incident response notification – external Communications teams experienced delays in updates and details needed to keep customers and partners continuously informed.

Response: Existing plans for an independently operated, public status page have been accelerated. High-severity incident playbooks have been updated to formalize the inclusion of Customer and External Communications leaders at all critical steps within an evolving incident.

UPDATE 2 (19:41 UTC, May 29, 2025): Access to consoles has been restored for all customers following today’s platform outage and service interruption. We continue to validate that all services are fully operational.

UPDATE 1 (18:10 UTC, May 29, 2025): Services are actively being restored and consoles are coming online.

Original Post from May 29, 2025

On May 29, 2025, SentinelOne experienced an outage that is impacting commercial customer consoles. The following message has been sent to all customers and partners. Communications are being updated real-time in our support portal and will be updated here as necessary.

We are aware of ongoing console outages affecting commercial customers globally and are currently restoring services. Customer endpoints are still protected at this time, but managed response services will not have visibility. Threat data reporting is delayed, not lost. Our initial RCA suggests this is not a security incident. We apologize for the inconvenience and appreciate your patience as we work to resolve the issue.

Official RCA for May 29 Service Disruption

Reason and Impact

Event Timeline

Causes and Corrective Steps

Original Post from May 29, 2025

Read more about Cyber Security

Read More

Official Root Cause Analysis (RCA) for SentinelOne Global Service Interruption – May 29, 2025

Official RCA for May 29 Service Disruption

Reason and Impact

Event Timeline

Causes and Corrective Steps

Original Post from May 29, 2025

Read more about Cyber Security

Read More