This article is also published on LinkedIn

I was reading the Cloudflare outage blog and thought I’d jot down what stood out to me.

The issue started when Cloudflare switched from using a single system account for database access to giving each user their own account. The idea was to “improve security and reliability so that each user could have their own query limits and access grants, and one bad subquery wouldn’t mess things up for everyone else.”

Cloudflare uses ClickHouse as their database, which stores data across multiple servers (shards). Their Bot Management system depends on a “feature file” that lists traits for their ML models to detect bots. This file is generated by querying the database and gets updated every few minutes.

Earlier, all DB queries would use the system account and just show results from the default database. But after the change, users started using their own accounts. Since users already had implicit access to the underlying tables, they made a change to make this access explicit, so users could see the metadata too. The idea was that all distributed subqueries would run under the initial user, so query limits and access grants could be more fine-grained.

But here’s where things went sideways, the query that generated the feature file didn’t specify which database to use. After the change, the query started returning duplicate columns, one set from each database. This more than doubled the number of features in the file.

And then, the Rust code that checks the feature count had a Result::unwrap() in it. That led to a panic, which in turn caused a 5xx error. All because of an unhandled error.

When my previous org, Conviva, adopted Rust, the architects would actually run defensive coding sessions so that these lessons were ingrained in every developer. Incidents like this are a reminder, when you’re working with system languages like Rust or C++, defensive coding isn’t just a feature, it’s a necessity. The intent behind the change could be good, but it’s the practices that would help ensure success.

Still amazed to think how a seemingly harmless unwrap on Result can bring the whole internet down.

Read the full story, it’s here: https://lnkd.in/gwiurEK4


<
Previous Post
Performance Comparison: Tokio vs Tokio-Uring for High-Throughput Web Servers
>
Next Post
How does Kafka scale for log processing?