ABSTRACT
Robust distributed systems commonly employ high-level recovery mechanisms enabling the system to recover from a wide variety of problematic environmental conditions such as node failures, packet drops and link disconnections. Unfortunately, these recovery mechanisms also effectively mask additional serious design and implementation errors, disguising them as latent performance bugs that severely degrade end-to-end system performance. These bugs typically go unnoticed due to the challenge of distinguishing between a bug and an intermittent environmental condition that must be tolerated by the system. We present techniques that can automatically pinpoint latent performance bugs in systems implementations, in the spirit of recent advances in model checking by systematic state space exploration. The techniques proceed by automating the process of conducting random simulations, identifying performance anomalies, and analyzing anomalous executions to pinpoint the circumstances leading to performance degradation.
By focusing our implementation on the MACE toolkit, MACEPC can be used to test our implementations directly, without modification. We have applied MACEPC to five thoroughly tested and trusted distributed systems implementations. MACEPC was able to find significant, previously unknown, long-standing performance bugs in each of the systems, and led to fixes that significantly improved the end-to-end performance of the systems.
- Bittorrent. http://bitconjurer.org/BitTorrent.Google Scholar
- CADAR, C., DUNBAR, D., AND ENGLER, D. R. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI (2008). Google ScholarDigital Library
- CASTRO, M., DRUSCHEL, P., KERMARREC, A.-M., NANDI, A., ROWSTRON, A., AND SINGH, A. SplitStream: High-bandwidth content distribution in cooperative environments. In SOSP (2003). Google ScholarDigital Library
- DABEK, F., COX, R., KAASHOEK, F., AND MORRIS, R. Vivaldi: A decentralized network coordinate system. In SIGCOMM (Portland, Oregon, 2004). Google ScholarDigital Library
- ENGLER, D. R., CHEN, D. Y., AND CHOU, A. Bugs as inconsistent behavior: A general approach to inferring errors in systems code. In SOSP (2001), pp. 57--72. Google ScholarDigital Library
- GEELS, D., ALTEKAR, G., MANIATIS, P., ROSCOE, T., AND STOICA, I. Friday: Global comprehension for distributed replay. In NSDI (2007). Google ScholarDigital Library
- GODEFROID, P. Model checking for programming languages using Verisoft. In POPL (1997). Google ScholarDigital Library
- GODEFROID, P., KLARLUND, N., AND SEN, K. Dart: directed automated random testing. In PLDI (2005). Google ScholarDigital Library
- GOLDSMITH, S., AIKEN, A., AND WILKERSON, D. S. Measuring empirical computational complexity. In ESEC/SIGSOFT FSE (2007), pp. 395--404. Google ScholarDigital Library
- HAVELUND, K., AND PRESSBURGER, T. Model checking Java programs using Java Pathfinder. Software Tools for Technology Transfer (STTT) 2(4) (2000), 72--84.Google Scholar
- JANNOTTI, J., GIFFORD, D. K., JOHNSON, K. L., KAASHOEK, M. F., AND JAMES W. O'TOOLE, J. Overcast: Reliable Multicasting with an Overlay Network. In OSDI(2000). Google ScholarDigital Library
- KILLIAN, C., ANDERSON, J. W., BRAUD, R., JHALA, R., AND VAHDAT, A. Mace: Language support for building distributed systems. In PLDI (2007). Google ScholarDigital Library
- KILLIAN, C., ANDERSON, J. W., JHALA, R., AND VAHDAT, A. Life, death, and the critical transition: Detecting liveness bugs in systems code. In NSDI (2007). Google ScholarDigital Library
- KOSTIĆ, D., BRAUD, R., KILLIAN, C., VANDEKIEFT, E., ANDERSON, J. W., SNOEREN, A. C., AND VAHDAT, A. Maintaining high bandwidth under dynamic network conditions. In USENIX ATC (2005). Google ScholarDigital Library
- KOSTIĆ, D., RODRIGUEZ, A., ALBRECHT, J., BHIRUD, A., AND VAHDAT, A. Using Random Subsets to Build Scalable Network Services. In USITS (2003). Google ScholarDigital Library
- LAMPORT, L. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169. Google ScholarDigital Library
- LUI, X., LIN, W., PAN, A., AND ZHANG, Z. Wids checker: Combating bugs in distributed systems. In NSDI (2007). Google ScholarDigital Library
- MOORE, D. S., AND MCCABE, G. P. Introduction to the Practice of Statistics, 3rd ed. W.H. Freeman, New York, 1999.Google Scholar
- MUSUVATHI, M., PARK, D., CHOU, A., ENGLER, D., AND DILL, D. CMC: A pragmatic approach to model checking real code. In OSDI (2002). Google ScholarDigital Library
- MUSUVATHI, M., AND QADEER, S. Iterative context bounding for systematic testing of multithreaded programs. In PLDI (2007). Google ScholarDigital Library
- MUSUVATHI, M., AND QADEER, S. Fair stateless model checking. In PLDI (2008). Google ScholarDigital Library
- MUSUVATHI, M., QADEER, S., BALL, T., BASLER, G., NAINAR, P. A., AND NEAMTIU, I. Finding and reproducing heisenbugs in concurrent programs. In OSDI (2008). Google ScholarDigital Library
- PATRICK REYNOLDS, CHARLES KILLIAN, J. L. W. J. C. M. M. A. S., AND VAHDAT, A. Pip: Detecting the unexpected in distributed systems. In NSDI (2006). Google ScholarDigital Library
- RHEA, S., GEELS, D., ROSCOE, T., AND KUBIATOWICZ, J. Handling churn in a dht. In USENIX ATC (2004). Google ScholarDigital Library
- RODRIGO FONSECA, GEORGE PORTER, R. H. K. S. S., AND STOICA, I. X-trace: A pervasive network tracing framework. In NSDI (2007). Google ScholarDigital Library
- ROWSTRON, A., AND DRUSCHEL, P. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Middleware (2001). Google ScholarDigital Library
- STOICA, I., MORRIS, R., KARGER, D., KAASHOEK, F., AND BALAKRISHNAN, H. Chord: A scalable peer to peer lookup service for internet applications. In SIGCOMM(2001). Google ScholarDigital Library
- VAHDAT, A., YOCUM, K., WALSH, K., MAHADEVAN, P., KOSTI´C, D., CHASE, J., AND BECKER, D. Scalability and Accuracy in a Large-Scale Network Emulator. In OSDI(2002). Google ScholarDigital Library
- YANG, J., CHEN, T., WU, M., XU, Z., LIU, X., LIN, H., YANG, M., LONG, F., ZHANG, L., AND ZHOU, L. MODIST: Transparent Model Checking of Unmodified Distributed Systems . In NSDI (2009). Google ScholarDigital Library
- ZELLER, A. Yesterday, my program worked. today, it does not. why? In ESEC / SIGSOFT FSE (1999), pp. 253--267. Google ScholarDigital Library
- ZHANG, X., GUPTA, N., AND GUPTA, R. Locating faults through automated predicate switching. In ICSE (New York, NY, USA, 2006), ACM, pp. 272--281. Google ScholarDigital Library
Index Terms
- Finding latent performance bugs in systems implementations
Recommendations
Understanding and detecting real-world performance bugs
PLDI '12Developers frequently use inefficient code sequences that could be fixed by simple patches. These inefficient code sequences can cause significant performance degradation and resource waste, referred to as performance bugs. Meager increases in single ...
Discovering, reporting, and fixing performance bugs
MSR '13: Proceedings of the 10th Working Conference on Mining Software RepositoriesSoftware performance is critical for how users perceive the quality of software products. Performance bugs---programming errors that cause significant performance degradation---lead to poor user experience and low system throughput. Designing effective ...
A qualitative study on performance bugs
MSR '12: Proceedings of the 9th IEEE Working Conference on Mining Software RepositoriesSoftware performance is one of the important qualities that makes software stand out in a competitive market. However, in earlier work we found that performance bugs take more time to fix, need to be fixed by more experienced developers and require ...
Comments