Abstract
Distributed systems are difficult to understand, design, build, and operate. They introduce exponentially more variables into a design than a single machine does, making the root cause of an application problem much harder to discover. It should be said that if an application does not have meaningful SLAs (service-level agreements) and can tolerate extended downtime and/or performance degradation, then the barrier to entry is greatly reduced. Most modern applications, however, have an expectation of resiliency from their users, and SLAs are typically measured by "the number of nines" (e.g., 99.9 or 99.99 percent availability per month). Each additional 9 becomes harder and harder to achieve.
- Allspaw, J. 2008. The Art of Capacity Planning. O'Reilly Media; http://shop.oreilly.com/product/9780596518585.do. Google ScholarDigital Library
- Bennett, C., Tseitlin, A. 2012. Netflix: Chaos Monkey released into the wild. Netflix Tech Blog; http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html.Google Scholar
- Chandra, T., Griesemer, R., Redstone, J. 2007. Paxos made live. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing: 398-407; http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/paxos_made_live.pdf Google ScholarDigital Library
- Dean, J., Ghemawat, S. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of 6th Symposium on Operating Systems Design and Implementation; http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/mapreduce-osdi04.pdf Google ScholarDigital Library
- DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P, and Vogels, W. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating System Principles; http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf Google ScholarDigital Library
- Google App Engine. Google Developers; https://developers.google.com/appengine/.Google Scholar
- Gray, J. 1985. Why do computers stop and what can be done about it? Tandem Technical Report 85.7; http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.6561.Google Scholar
- Heroku Cloud Application Platform; http://www.heroku.com/.Google Scholar
- Somers, J. Heroku's ugly secret; http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics.Google Scholar
- Tarreau, W. 2006. Make applications scalable with load balancing; http://1wt.eu/articles/2006_lb/.Google Scholar
Index Terms
- There’s Just No Getting around It: You’re Building a Distributed System: Building a distributed system requires a methodical approach to requirements.
Comments