Americas

  • United States
denise_dubie
Senior Editor

Top 7 outages of 2023

News
Feb 02, 202410 mins
Cloud ComputingNetwork Management SoftwareNetwork Monitoring

Seven significant outages last year offer insights into how everything from small network changes to power supplies can cause global disruptions, according to ThousandEyes’ annual roundup.

shutterstock 1127162939 traffic light  red yellow green code 1200x800
Credit: Shutterstock

The most notable outages of 2023 led to service degradations and network disruptions for top tech providers such as Microsoft and AWS, proving that even the most sophisticated environments are not immune from downtime, according to ThousandEyes.

In 2023, seven significant outages wreaked havoc across networks, impacting end users and customers with poor performance and traffic slowdowns, according to analysis from ThousandEyes, a Cisco-owned network intelligence company that tracks internet and cloud traffic. The digital experience company noted with this year’s outages that small changes can bring about big disruptions across global networks, causing companies to scramble to restore full service.

[ Follow NetworkWorld’s weekly report on global internet and cloud outages ]

“2023 saw many outages across SaaS applications, ISPs, and other supporting infrastructures. These outages leave important lessons that can help teams minimize the impact of future disruptions, as well as proactively optimize their services and applications for more predictable performance,” ThousandEyes said in a blog sharing details about the 2023 outages. While the most common outages happened with ISPs, ThousandEyes noted that cloud service provider (CSP) outages were the second most common type of disruption in 2023, proving each year that businesses are relying more on cloud infrastructure.

Here are the top seven outages of the year organized chronologically.

Microsoft loses connectivity across its 365 apps: Jan. 25

Microsoft users experienced global connectivity issues with Microsoft services including Azure, Teams, Outlook, and SharePoint for about 90 minutes on January 25, 2023. High levels of packet loss across the network caused Microsoft and other services to become unavailable due to connectivity issues, resulting in users experiencing HTTP and DNS timeouts. According to ThousandEyes’ outage analysis, a significant number of border gateway protocol (BGP) route changes immediately precipitated the packet loss. BGP informs what route network traffic will take, and if the information is inaccurate, traffic could take the wrong route. Efforts to change routes to find the best possible path for traffic repeated several times, “resulting in significant churn (route table instability).”  In this case, ThousandEyes determined an automated process could be involved “due to the rapid nature of the changes.”

“Rapid changes in traffic paths coupled with a large-scale shift of traffic through transit provider networks, would have led to the level of loss seen throughout this incident,” ThousandEyes reported. And while the outage was significant in terms of global impact and affected users, ThousandEyes credited Microsoft for its fast remediation efforts. “It was apparent that Microsoft did indeed quickly begin mitigation methods, signaling that they had ample visibility of the problem as well as rollback and remediation plans. The length of this outage is likely a result of an operations team ensuring that they are doing the right thing given the scope of the outage they were facing.”

Outlook users lose access to the Microsoft service: Feb. 7

Shortly following the January incident, Microsoft Outlook users experienced another outage on February 7, 2023. Microsoft customers across North America, Europe, and Asia experience issues accessing Outlook for several hours, with the greatest impact experienced in the U.S.  

While the outage was global in nature, unlike the previous incident, ThousandEyes determined the network might not be the root cause of the issue as there was no significant packet loss, latency, or unusual routing behavior observed during the incident, according to ThousandEyes. “During the outage, ThousandEyes vantage points observed symptoms indicative of application-related issues, including elevated service response timeouts and increased page load times,” ThousandEyes reported.  

Two outages impact service for Virgin Media UK: April 4

BGP routing appeared to be the primary cause of two outages that impacted Virgin Media UK on April 4, 2023. The outages affected the reachability of the Virgin Media UK network and its services to the global internet. The two incidents happened the same day, spanning for most of the day and lasting several hours each time. According to ThousandEyes, “a lack of viable BGP routes appeared to cause most of the observed traffic loss.”

ThousandEyes determined the two outages had similar characteristics that included the withdrawal of routes of its network, traffic loss, and intermittent periods of service restoration. “Given that the initial incident began in a period of time typical of maintenance activities (half past midnight local time), it may have resulted from a change to the network state by the service provider,” ThousandEyes said. “Recurrence of a near identical incident later in the day could indicate that the triggering mechanism for the first incident was either not fully understood or was not completely resolved.”

AWS incident impacts services for 2 hours: June 13

On June 13, 2023, Amazon Web Services (AWS) experienced a more than two-hour incident that impacted a number of services on the East Coast of the U.S. The disruption began in the evening and was resolved a couple of hours later, but ThousandEyes did not observe any significant issues such as high latency or packet loss for network paths to AWS servers. Yet the network observability provider did notice an increase in latency, server timeouts, and HTTP server errors impacting the availability of applications hosted within AWS.

“The incident appears to have manifested as elevated response times, timeouts, and HTTP 5XX server errors for users attempting to access impacted applications,” ThousandEyes said. Shortly after the incident began, AWS identified the source of the issue as a capacity management subsystem that was impacting the availability of many of its services, including Lambda, AWS Management Console, and more. According to ThousandEyes, AWS confirmed that these affected services were experiencing “increased error rates and latencies,” which caused service availability issues for applications using these AWS services, “regardless of where they were hosted or where they were serving users.”

“This incident illustrates the complex web of interdependencies that applications and services rely on today. Many of these dependencies may be indirect, or ‘hidden,’ from the organizations, as they may be dependencies of the services they are directly consuming,” ThousandEyes said in its analysis of the incident.

Slack suffers usability issues: Aug. 2

For Slack, a significant incident caused performance problems but not a complete outage. Still the disruption made it difficult for Slack users to complete desired tasks. On Aug. 2, 2023, for about 2 hours Slack users experienced issues uploading files and with images appearing blurry, and the same issue resulted in some users seeing delays in other functions of the service, such as prolonged page load times, an inability to log in, and general instability. This disruption, according to ThousandEyes, is an example of how a service can be available but not necessarily usable.

ThousandEyes initial observations found an increase in HTTP 500 server unavailable errors and higher-than-normal page load times for global users trying to reach Slack, but with further investigation, it was revealed that the Slack web client was loading just 15 objects when it typically loads around 28 to function. “Given this, there were signs early on that Slack’s issues likely related to problems with the application backend,” ThousandEyes reported.

ThousandEyes highlighted two interesting points about the Slack service disruption: the first is that it occurred at the top of an hour, which typically indicates a scheduled job; and the second comes from Slack’s own post-incident report detailing that the root cause was work on one part of the service, “a routine database cluster migration,” which accidentally reduced database capacity.

“The scheduled job, combined with the normal operational needs of users, saw database requests gradually build up to the point at which they choked the queue,” ThousandEyes said. The incoming requests unintentionally reduced the capacity of the database cluster, which led to errors in some actions in Slack.

Square disruption prevents transaction processing: Sept. 8

Contactless payments terminal and service provider Square on Sept. 8, 2023, experienced a more than 18-hour disruption that prevented customers from processing transactions. Multiple Square services felt the impact globally that was determined to be caused by backend connectivity issues. It is expected the impact of the outage could have been more significant because the timeframe doesn’t “appear to take into account flow-on impacts to funds that were transferring and other payments processing,” according to ThousandEyes.

Square users reported problems such as payments appearing to complete but then not showing up in business accounts as well as terminal connections dropping out, and ThousandEyes reported that it had observed “intermittent dropouts and 503 ‘service unavailable’ errors. The degradation pattern suggested the root cause may have been an internal routing or similar backend system.” Square confirmed it was the backend system, specifically DNS, that caused the issue in its post-incident report.

“While making several standard changes to our internal network software, the combination of updates prevented our systems from properly communicating with each other, and ultimately caused the disruption,” Square reported.

Power loss leads to Workday and Cloudflare outages: Nov. 2

Workday and Cloudflare both experienced service outages on Nov. 2, 2023, which ThousandEyes said it believed were related. According to ThousandEyes, the common link between the outages “appears to be a partial mains power outage at Flexential data center in Portland, Ore.” Cloudflare indicated this was the cause of its disruption in its post-mortem report, and Workday also pointed to a data center in Portland as the source of the issue.

“A combination of post-mortems, OSINT [open source intelligence tools], and ThousandEyes observations” point to the two incidents being related. While Cloudflare issued a detailed post-mortem report, Workday provided fewer details, but stated, “Due to issues with backup power failures, as well as an unstable power environment resulting in additional challenges, service restoration has taken longer than is typical.”

ThousandEyes observed “page content did not match” errors, which occur when the interaction between the client and server breaks down, as well as an immediate redirect at the login request to a static maintenance page. “Another element to note is that ThousandEyes tests show that the static content is being served out of AWS; prior to the outage, Workday content was being served via Cloudflare,” ThousandEyes reported.