Summary: | Providing strong fault-tolerant guarantees for the modern cloud is difficult, as application developers must
coordinate between independent stateful services and ephemeral compute, and handle various failure-induced
anomalies. We propose Composable Resilient Steps (CReSt), a new abstraction for resilient cloud applications.
CReSt uses fault-tolerant steps as its core building block, which allows participants receive, process, and send
messages as a single uninterruptible atomic unit. Composability and reliability are orthogonally achieved by
reusable CReSt implementations, for example, leveraging reliable message queues. Thus, CReSt application
builders focus solely on translating application logic into steps, and infrastructure builders focus on efficient
CReSt implementations. We propose one such implementation, called DARQ (for Deduplicated Asynchronously
Recoverable Queues). At its core, DARQ is a storage service that encapsulates CReSt participant state and
enforces CReSt semantics; developers attach ephemeral compute nodes to DARQ instances to implement
stateful distributed components. Services built with DARQ are resilient by construction, and CReSt-compatible
services naturally compose without loss of resilience. For performance, we propose a novel speculative
execution scheme to execute CReSt steps without waiting for message persistence in DARQ, effectively eliding
cloud persistence overheads; our scheme maintains CReSt’s fault-tolerance guarantees and automatically
restores consistent system state upon failure. We showcase the generality of CReSt and DARQ using two
applications: cloud streaming and workflow processing. Experiments show that DARQ is able to achieve
extremely low latency and high throughput across these use cases, often beating state-of-the-art customized
solutions.
|