Crypto giant Coinbase has published a post-mortem report describing the incidents that led to the May 7 outage.
The platform said on Monday that the service disruption, which lasted for several hours, was due to a multiple malfunctioning of chiller units in a single Amazon Web Service (AWS). The outage caused trading, depositing, withdrawing and other key processes on the platform to fail.
However, the incident isn’t a standalone one. According to historical status logs, it has experienced four major, platform-wide outages that severely impacted core trading and account access since its inception.
What was happening during the outage?
For several hours, users of the Coinbase platform were unable to access many of the main features.
Beginning around 7:48 PM ET on May 7, retail customers faced issues buying and selling crypto, sending and receiving funds, and making deposits or withdrawals across most Coinbase products.
The disruption also affected Coinbase Prime clients, with institutional trading services experiencing problems in routing orders efficiently.
The event reportedly began when a number of cooling units malfunctioned at an AWS data center belonging to the Coinbase us-east-1 region.
With temperatures reaching critical levels, AWS’ safety mechanisms kicked in and shut off several server clusters and storage services to protect their hardware from potential damage. This led to the shutdown of essential Coinbase services, resulting in the outage.
Coinbase points to two major issues
According to Coinbase’s report, the outage dragged on for several hours because the exchange ran into two significant technical hurdles while trying to restore services.
At the center of the problem was Coinbase’s matching engine, which is the core system that processes and executes trades. The engine was heavily concentrated within a single AWS facility to keep trading speeds as fast as possible.
However, due to the failure of key nodes, there were no enough functional nodes available in order for the system to run correctly. This problem was further exacerbated because the system lacked an automatic backup plan to failover into an alternative AWS availability zone.
Consequently, the engineers at Coinbase had to do this manually, modify the configurations, and build a new node cluster. While it managed to bring back the requisite system quorum at 12:06 AM ET, it did not resume trading immediately because more action was required in order for the markets to open.
Secondly, the second reason why the market stoppage occurred was an issue with the AWS service Managed Streaming for Apache Kafka (MSK). The incident report highlights how a problem in the control plane of the AWS MSK caused problems with the automation of the election of a new partition leader.
As a result, certain clusters become clogged and event stream pipelines are unable to handle data properly.
Coinbase admits issues with infrastructure
According to the report, the system operated below the threshold of reliability expected by the firm for their infrastructure. In addition to that, the firm claimed that its design was prepared to deal with the outage of only one AWS availability zone; however, this particular incident exposed areas that were lacking.
Coinbase has also appreciated the work put by the engineers of both parties in bringing the platform back into operation.
The incident comes amid a higher level of dependency of big platforms on third party infrastructure. Infrastructure failures can have outsized effects on trading systems that need to run continuously and at high speed.
The report also points out that these kinds of outages can become particularly serious during volatile market conditions, when users need instant, uninterrupted access to trade and manage positions.
