Stragglers vs. Failures - Do you know the difference?
Stragglers vs. Failures - Do you know the difference?
➤ A failure is a request that doesn’t complete. ➤ A straggler is a request that completes but takes too long - often caused by a backend garbage collection (GC) pause, a hot partition, or a kernel scheduling blip.
From the caller’s perspective, both damage p99. However, they require fundamentally different architectural solutions.
Read Prathamesh Bhope’s #InfoQ article for a deeper dive: https://bit.ly/4uDWKg1
#DistributedSystems #CloudComputing #SoftwareEngineering #Performance
Stragglers, Not Failures: How Adaptive Hedged Requests Reduce p99 Latency by 74 Percent
n fan-out microservice architectures, slow-but-completing requests accumulate across services and drive p99 latency far higher than per-service metrics suggest. This article presents an adaptive hedging mechanism that uses DDSketch for real-time quantile estimation, windowed rotation to handle distribution drift, and a token-bucket budget to prevent load amplification.
bit.ly
Comments (0)