Coinbase's pivot to AI-driven operations is not going well

Coinbase (Nasdaq: COIN) once again showed cryptocurrency traders how slow cloud hardware can ruin even a fast exchange. It appears that the company’s strategy of pivoting to AI-driven operations may have been its worst move yet.

A cooling failure within Amazon Web Services (Nasdaq: AMZN) helped cause a multi-hour outage that impacted trading, exchange access and balance updates across its platform, the company said on Friday.

The issue began around 23:50 UTC on May 7 when internal monitors detected widespread bidding failures within the company’s systems.

At that point, several Sev1 incidents have been created by engineers, and customers have already been affected in terms of services such as spot trading, Coinbase Prime, international, derivatives, retail, advanced exchanges, and institutional exchanges.

Brian Armstrong, who is the CEO of Coinbase, wrote on X that his company “experienced an outage” and that such an event was “completely unacceptable.” According to him, this was caused by “overheating of the room in the AWS data center due to the failure of several chillers.”

According to Brian, the company ensures that all of its services are designed to not go offline if an AWS Availability Zone fails. The majority of services are organized this way, with the exception of the stock exchange, which uses a different infrastructure due to high latency requirements.

Coinbase blames failed AWS coolers as price systems begin to collapse before midnight UTC

It was I mentioned According to Cryptopolitan earlier, Coinbase plans to fire 700 workers from its employees because they constitute approximately 14% of the total workforce. This is done with the aim of replacing manual processes with artificial intelligence.

Rob Witoff, who heads up Coinbase, provided technical details on the issue. According to him, the outage lasted for a long time and affected “trading, access to the exchange and balance updates.”

The initial warning came at 23:50 UTC due to a quote failure from within the internal systems. This was followed by immediate Sev1 analysis. According to Rob, this challenge was caused by a “thermal event” in a small percentage of racks at one facility in AWS us-east-1.

Such a structure of exchange infrastructure was beneficial. Coinbase keeps the exchange’s infrastructure in a single availability zone, where the industry values speed, Rob said.

In addition, the company has a distributed backup of the exchange infrastructure in case such scenarios occur. But the failure of one part of the exchange infrastructure in question at the moment did not remain within its limits, thus prolonging the process of fixing the situation.

Two components failed. A malfunction has occurred in the hardware under the corresponding motor. So, before anything else, there was a need to perform recoveries and failovers.

The Kafka distributed cluster, which is tasked with sharing information across all systems within the organization, also crashed. It took up to a terabyte of information to recover the Kafka partitions on a new hardware medium.

Engineers are rebuilding quorum and bringing Coinbase markets back with cancellation-only and auction modes

The matching engine was responsible for the largest trading booth. The matching engine processes orders and maintains order books. The system operates in a distributed cluster and requires a quorum before choosing a leader and making trades safely.

Since not all nodes remained intact due to limitations in the data center during the outage, it was not possible to achieve a quorum, thus preventing trading activities on retail, advanced and institutional exchanges.

Rob mentioned that the on-call engineering and support teams had to implement the company’s disaster recovery procedures, determine quorums, and evaluate system health under difficult infrastructure conditions.

According to For him, the team had to develop, test, deploy and validate the solution while managing broader outages. Kafka required extensive manual recovery because its partitioned architecture manages thousands of terabytes per day.

There were some issues with lagging balance flows because Kafka was lagging. Rob mentioned that these issues with balances went away after the replication sync. According to Coinbase, no data was lost.

When the corresponding engine returned to service, the markets were not re-enabled simultaneously. First, Coinbase switched all products to cancellation-only mode, checked product statuses, switched all markets to auction mode, and finally, enabled trading on the Coinbase Exchange.

Furthermore, Rob emphasized that customers should not be temporarily blocked from accessing their accounts. Coinbase assured everyone that the company will provide a detailed explanation of this incident within several weeks.

However, Josh Ellethorpe refuted the rumors after reading Rob’s Twitter post. As he put it, “No one coded something that failed. No ‘non-engineer’ pushed the production code and took the trading engine out. It wasn’t intentional. It wasn’t because Coinbase failed to design a failover system. Things happen at scale, don’t let armchair quarterbacks tell you tall tales.”

Source link

Coinbase’s pivot to AI-driven operations is not going well

Coinbase blames failed AWS coolers as price systems begin to collapse before midnight UTC

Engineers are rebuilding quorum and bringing Coinbase markets back with cancellation-only and auction modes

Leave a ReplyCancel Reply

The evolution of the Bitcoin cycle has arrived: lower volatility, smarter accumulation

Connecticut passes sweeping AI regulation bill SB5