Shuttle
One of the coolest technologies I have had the pleasure of using recently is the shuttle library. It's a concurrency verification tool developed by some of my former coworkers at AWS.
Motivation
Writing correct concurrent code involving locks, message passing, and atomics can be difficult.
A big contributor to the difficulty is that testing and debugging concurrent code is non-trivial. It's common to just write single-threaded unit tests that exercise simple serial behavior.
A more robust technique that I've employed a lot in the past is making a multi-threaded unit test and executing it in a loop hundreds or thousands of times. The idea is that any weird concurrency bugs will be probabilistically exercised.
While you can get a fair amount of mileage out of this technique, it's definitely not foolproof.
A certain subset of possible execution permutations may be much more likely than others due to system-specific details.
Some concurrency bugs that happen to be rare on your particular system sneak through your testing process. Then — inevitably — they show up in hard-to-reproduce production bugs after weeks or months of lurking undetected.
Reproducibility is another issue. Even if you do manage to hit some assertion error after running a concurrent test a hundred thousand times, it's hard to exercise the same path consistently enough to effectively debug.
Solution
The problem can be solved with controlled thread scheduling.
The shuttle
library vends drop-in replacements of all the standard library's concurrency primitives.
The user runs their concurrent test inside a shuttle
harness that controls the scheduling behavior of those concurrency primitives.
The harness explores different execution permutations by running the test repeatedly and manipulating the scheduling behavior.
This is basically the same technique used by the loom library. The main difference between the two is that loom
does exhaustive exploration and shuttle
does random exploration.
You might think that exhaustive exploration is strictly superior, but that's not always the case.
The number of possible execution permutations grows factorially with the number of threads and synchronization points. That means that exhaustive exploration is only feasible for very simple cases.
Random exploration enables exploring arbitrarily complicated execution permutations and is trivially parallelizable. You could spin up thousands of EC2 instances testing some business-critical concurrent code if you wanted.
In practice, nearly all of the bugs that shuttle
catches for me happen within a few hundred iterations — far from the effectively-infinite number of possible execution permutations any complicated code has.
Worked example
Let's say we wanted to write a Semaphore
object. Our first attempt might look like this:
/// A basic semaphore that only allows a fixed number of | |
/// outstanding permits. | |
pub struct Semaphore { | |
remaining_permits: Arc<AtomicU64>, | |
} | |
impl Semaphore { | |
pub fn new(permits: u64) -> Self { | |
Self { | |
remaining_permits: Arc::new(AtomicU64::new(permits)), | |
} | |
} | |
/// Acquire a `Permit`, or return `None` if there are no available | |
/// permits. | |
pub fn try_acquire(&self) -> Option<Permit> { | |
// If we still have remaining permits, subtract one | |
// and vend a `Permit`. | |
if self.remaining_permits.load(Ordering::SeqCst) > 0 { | |
self.remaining_permits.fetch_sub(1, Ordering::SeqCst); | |
Some(Permit(Arc::clone(&self.remaining_permits))) | |
} else { | |
None | |
} | |
} | |
} | |
pub struct Permit(Arc<AtomicU64>); | |
/// Increment the remaining permits count on `drop` | |
impl Drop for Permit { | |
fn drop(&mut self) { | |
self.0.fetch_add(1, Ordering::SeqCst); | |
} | |
} |
Then, being good engineers, we write a shuttle
test:
#[test] | |
fn shuttle_test() { | |
shuttle::check_random(|| { | |
// In this test, we set up a `Semaphore` with | |
// only 2 permits. Then we spin up 5 threads | |
// to acquire a permit each. Only 2 should succeed. | |
let semaphore = Arc::new(Semaphore::new(2)); | |
// Use this barrier to make all threads waits | |
// until the others have tried to acquire before | |
// they exit (and drop their permits). | |
let wait_barrier = Arc::new(Barrier::new(5)); | |
let successful_acquisitions = Arc::new(AtomicU64::new(0)); | |
let threads = (0..5) | |
.map(|_| { | |
let semaphore = Arc::clone(&semaphore); | |
let wait_barrier = Arc::clone(&wait_barrier); | |
let successful_acquisitions = Arc::clone(&successful_acquisitions); | |
thread::spawn(move || { | |
let permit = semaphore.try_acquire(); | |
if permit.is_some() { | |
successful_acquisitions.fetch_add(1, Ordering::SeqCst); | |
} | |
wait_barrier.wait(); | |
}) | |
}) | |
.collect::<Vec<_>>(); | |
for thread in threads { | |
thread.join().unwrap(); | |
} | |
assert_eq!(successful_acquisitions.load(Ordering::SeqCst), 2); | |
}, 100); | |
} |
Upon running it, we get an assertion failure!
test panicked in task 'main-thread' | |
failing schedule: "91033fc398c79ec5c8b9c6420002608464" | |
pass that string to `shuttle::replay` to replay the failure | |
thread 'tests::shuttle_test' panicked at 'assertion failed: `(left == right)` | |
left: `5`, | |
right: `2`', src/main.rs:117:9 |
We see that shuttle
has found some execution that results in all 5 threads acquiring permits. It even gives me a way to re-run the exact execution schedule.
So we re-run the same execution after inserting some tactical dbg!
statements, and we find the problem.
It's possible for multiple threads to all simultaneously confirm that the remaining permit count is positive and proceed into the body of the if
-statement.
This is a classic newbie concurrency error.
It's a little harder to mess this up in Rust — the clunky AtomicU64
methods make it obvious that something interesting is going on with regard to concurrency. Nonetheless, this class of error has doubtlessly bitten anyone getting started with multi-threaded programming.
So let's try to fix it with a little optimistic concurrency:
pub fn try_acquire(&self) -> Option<Permit> { | |
// Optimistically subtract a permit, but add it back if we | |
// were actually already at 0 | |
let previous = self.remaining_permits.fetch_sub(1, Ordering::SeqCst); | |
if previous > 0 { | |
Some(Permit(Arc::clone(&self.remaining_permits))) | |
} else { | |
self.remaining_permits.fetch_add(1, Ordering::SeqCst); | |
None | |
} | |
} |
Re-running shuttle
, we get a new error:
test panicked in task 'main-thread' | |
failing schedule: "910333a4f99ef198f1fae" | |
pass that string to `shuttle::replay` to replay the failure | |
thread 'tests::shuttle_test' panicked at 'assertion failed: `(left == right)` | |
left: `4`, | |
right: `2`', src/main.rs:111:9 |
Repeating the debugging process, the new problem becomes clear.
When the remaining permit count is at 0, optimistically subtracting 1 causes it to underflow back to u64::MAX
. If a second thread checks that the permit count is positive before the original thread rolls back the optimistic subtraction, the second thread will succeed (thinking that there are an absurdly high number of permits available).
We try again, with a compare-and-swap technique reminiscent of a spinlock.
pub fn try_acquire(&self) -> Option<Permit> { | |
// Read the current value. If it's 0, fail. | |
// Otherwise, do a compare-and-swap that will | |
// only subtract 1 if the value didn't change | |
// after the last time we looked. | |
// If it did change, just loop back around and | |
// try again. | |
loop { | |
let current = self.remaining_permits.load(Ordering::SeqCst); | |
if current == 0 { | |
return None; | |
} | |
if self.remaining_permits.compare_exchange( | |
current, | |
current - 1, | |
Ordering::SeqCst, | |
Ordering::SeqCst, | |
).is_ok() { | |
return Some(Permit(Arc::clone(&self.remaining_permits))); | |
} | |
} | |
} |
Finally, shuttle
gives us a passing grade.
Integration
Dealing with replacing all of your std::sync
primitives with shuttle::sync
primitives is a minor complication. The easy way to do this is with conditional compilation and re-exports. Here's the basic idea:
#[cfg(feature = "shuttle")] | |
mod sync { | |
pub use shuttle::sync::*; | |
} | |
#[cfg(not(feature = "shuttle"))] | |
mod sync { | |
pub use std::sync::*; | |
} | |
use crate::sync::Arc; | |
use crate::sync::atomic::{AtomicU64, Ordering}; |
You should also put your shuttle
tests behind the same feature flag so they won't be run with non-shuttle
concurrency primitives.
Conclusion
Testing and debugging concurrent code with shuttle
is almost trivially easy. I can't imagine going back to writing serious concurrent code without it.
S3 uses shuttle
to verify the correctness of extremely complicated concurrent software storing exabytes of data.
My coworkers wrote a paper about our experience using shuttle
and wound up winning Best Paper at SOSP '21.