distributed systems algorithms

So any proofs we have here apply to those systems too, Kill a process at the right time and you can break, Moreover, FLP assumes deterministic processes, Ben-Or 1983: "Another Advantage of free choice", With at least two proposers, or one malicious proposer, N > 2F + M. With at least 2 proposers, or one malicious proposer, it takes at least 2 The crashing of the leader can lead to data inconsistency. Mutual exclusion is a concurrency control property which is introduced to prevent race conditions. This is how financial companies and retailers do it! Kafka, Kestrel, Rabbit, IronMQ, ActiveMQ, HornetQ, Beanstalk, SQS, Celery, ... Journals work to disk on multiple nodes for redundancy, Useful when you need to acknowledge work now, and actually do it later, Send data reliably between stateless services, Queues do not provide total event ordering when consumers are concurrent, Your consumers are almost definitely concurrent, Likewise, queues don't guarantee event order with async consumers, Because consumer side effects could take place out of order, Queues can offer at-most-once or at-least-once delivery, Anyone claiming otherwise is trying to sell you something, Recovering exactly-once delivery requires careful control of side effects, Distributed queues also improve fault tolerance (if they don't lose data), If you don't need the fault-tolerance or large buffering, just use TCP, Lots of people use a queue with six disk writes and fifteen network hops I don't have good answers for you, Ask Jeff Hodges why it's hard: see his RICON West 2013 talk. We use data structure stores as outsourced heaps: they're the duct tape of consensus systems, Write availability limited by identity store, But, reads eminently cachable if you only need sequential consistency, Can be even cheaper if you only need serializability, See Rich Hickey's talks on Datomic architecture, See Pat Helland's 2013 RICON West keynote on Salesforce's storage, Systems which are order-independent are easier to construct and reason about, CRDTs are confluent, which means we can apply updates without waiting, Immutable values are trivially confluent: once present, fixed. ), Excellent for commutative/monotonic systems, Foreign key constraints for multi-item updates, Can ensure convergence given arbitrary finite delay ("eventual consistency"), Good candidates for geographically distributed systems, Probably best in concert with stronger transactional systems, See also: COPS, Swift, Eiger, Calvin, etc, Nontriviality: Only values proposed can be learned. corruption. Asynchronous Networks", Most networks recover in seconds to weeks, not "never", Conversely, human timescales are on the orders of seconds to weeks, So we can't pretend the problems don't exist, Byzantine networks are allowed to mess with messages, They mostly don't happen in real networks, In practice, TCP prevents duplicates and reorders in the context of a single Associate Professor Lydia Y. Chen has been invited to the editorial boards of IEEE Transactions on Dependable and Secure Computing and IEEE transactions on Parallel and Distributed System.

in [c, 1], Execute independently, whenever: step time is anywhere in [0, 1], Weaker than semi- or synchronous networks, Implies certain algorithms can't be as efficient, See e.g. independent chunks, Try to align memory barriers to work unit boundaries, Allows the processor to cheat as much as possible within a work unit, You'll often deploy replicated systems across something like an ethernet LAN, Message latencies can be as low as 100 micros, But across any sizable network (EC2), expect low millis, Network is within an order of mag compared to uncached disk seeks, EC2 disk latencies can routinely hit 20ms, LMAO if you think anything in EC2 is real, But network is waaaay slower than memory/computation, But there are other reasons to distribute, Humans can detect ~10ms lag, will tolerate ~100ms, Only way to beat the speed of light: move the service closer, Maybe 4 rounds all the time if you have a bad Paxos impl (e.g. Intel QPI), Whole complicated set of protocols in HW & microcode to make memory look Sequential IDs require coordination: can you avoid them? Only one process is allowed to execute the critical section (CS) at any given time. subject of what is probably the most often-cited article on the theory of once, Inform load balancer that they're going out of rotation, Roll out only to a fraction of load or fraction of users, Gradually ramp up number of users on the new software, Either revert or roll forward when you see errors, Consider shadowing traffic in prod and comparing old/new versions, Good way to determine if new code is faster & correct, We want incremental rollouts of a changeset after a deploy, Introduce features one by one to watch their impact on metrics, Gradually shift load from one database to another, We want to obtain partial availability when some services are degraded, Disable expensive features to speed recovery during a failure, Use a highly available coordination service to decide which codepaths to