skip to main content
10.1145/1882291.1882297acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Finding latent performance bugs in systems implementations

Published:07 November 2010Publication History

ABSTRACT

Robust distributed systems commonly employ high-level recovery mechanisms enabling the system to recover from a wide variety of problematic environmental conditions such as node failures, packet drops and link disconnections. Unfortunately, these recovery mechanisms also effectively mask additional serious design and implementation errors, disguising them as latent performance bugs that severely degrade end-to-end system performance. These bugs typically go unnoticed due to the challenge of distinguishing between a bug and an intermittent environmental condition that must be tolerated by the system. We present techniques that can automatically pinpoint latent performance bugs in systems implementations, in the spirit of recent advances in model checking by systematic state space exploration. The techniques proceed by automating the process of conducting random simulations, identifying performance anomalies, and analyzing anomalous executions to pinpoint the circumstances leading to performance degradation.

By focusing our implementation on the MACE toolkit, MACEPC can be used to test our implementations directly, without modification. We have applied MACEPC to five thoroughly tested and trusted distributed systems implementations. MACEPC was able to find significant, previously unknown, long-standing performance bugs in each of the systems, and led to fixes that significantly improved the end-to-end performance of the systems.

References

  1. Bittorrent. http://bitconjurer.org/BitTorrent.Google ScholarGoogle Scholar
  2. CADAR, C., DUNBAR, D., AND ENGLER, D. R. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. CASTRO, M., DRUSCHEL, P., KERMARREC, A.-M., NANDI, A., ROWSTRON, A., AND SINGH, A. SplitStream: High-bandwidth content distribution in cooperative environments. In SOSP (2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. DABEK, F., COX, R., KAASHOEK, F., AND MORRIS, R. Vivaldi: A decentralized network coordinate system. In SIGCOMM (Portland, Oregon, 2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. ENGLER, D. R., CHEN, D. Y., AND CHOU, A. Bugs as inconsistent behavior: A general approach to inferring errors in systems code. In SOSP (2001), pp. 57--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. GEELS, D., ALTEKAR, G., MANIATIS, P., ROSCOE, T., AND STOICA, I. Friday: Global comprehension for distributed replay. In NSDI (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. GODEFROID, P. Model checking for programming languages using Verisoft. In POPL (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. GODEFROID, P., KLARLUND, N., AND SEN, K. Dart: directed automated random testing. In PLDI (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. GOLDSMITH, S., AIKEN, A., AND WILKERSON, D. S. Measuring empirical computational complexity. In ESEC/SIGSOFT FSE (2007), pp. 395--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. HAVELUND, K., AND PRESSBURGER, T. Model checking Java programs using Java Pathfinder. Software Tools for Technology Transfer (STTT) 2(4) (2000), 72--84.Google ScholarGoogle Scholar
  11. JANNOTTI, J., GIFFORD, D. K., JOHNSON, K. L., KAASHOEK, M. F., AND JAMES W. O'TOOLE, J. Overcast: Reliable Multicasting with an Overlay Network. In OSDI(2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. KILLIAN, C., ANDERSON, J. W., BRAUD, R., JHALA, R., AND VAHDAT, A. Mace: Language support for building distributed systems. In PLDI (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. KILLIAN, C., ANDERSON, J. W., JHALA, R., AND VAHDAT, A. Life, death, and the critical transition: Detecting liveness bugs in systems code. In NSDI (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. KOSTIĆ, D., BRAUD, R., KILLIAN, C., VANDEKIEFT, E., ANDERSON, J. W., SNOEREN, A. C., AND VAHDAT, A. Maintaining high bandwidth under dynamic network conditions. In USENIX ATC (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. KOSTIĆ, D., RODRIGUEZ, A., ALBRECHT, J., BHIRUD, A., AND VAHDAT, A. Using Random Subsets to Build Scalable Network Services. In USITS (2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. LAMPORT, L. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. LUI, X., LIN, W., PAN, A., AND ZHANG, Z. Wids checker: Combating bugs in distributed systems. In NSDI (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. MOORE, D. S., AND MCCABE, G. P. Introduction to the Practice of Statistics, 3rd ed. W.H. Freeman, New York, 1999.Google ScholarGoogle Scholar
  19. MUSUVATHI, M., PARK, D., CHOU, A., ENGLER, D., AND DILL, D. CMC: A pragmatic approach to model checking real code. In OSDI (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. MUSUVATHI, M., AND QADEER, S. Iterative context bounding for systematic testing of multithreaded programs. In PLDI (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. MUSUVATHI, M., AND QADEER, S. Fair stateless model checking. In PLDI (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. MUSUVATHI, M., QADEER, S., BALL, T., BASLER, G., NAINAR, P. A., AND NEAMTIU, I. Finding and reproducing heisenbugs in concurrent programs. In OSDI (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. PATRICK REYNOLDS, CHARLES KILLIAN, J. L. W. J. C. M. M. A. S., AND VAHDAT, A. Pip: Detecting the unexpected in distributed systems. In NSDI (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. RHEA, S., GEELS, D., ROSCOE, T., AND KUBIATOWICZ, J. Handling churn in a dht. In USENIX ATC (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. RODRIGO FONSECA, GEORGE PORTER, R. H. K. S. S., AND STOICA, I. X-trace: A pervasive network tracing framework. In NSDI (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. ROWSTRON, A., AND DRUSCHEL, P. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Middleware (2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. STOICA, I., MORRIS, R., KARGER, D., KAASHOEK, F., AND BALAKRISHNAN, H. Chord: A scalable peer to peer lookup service for internet applications. In SIGCOMM(2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. VAHDAT, A., YOCUM, K., WALSH, K., MAHADEVAN, P., KOSTI´C, D., CHASE, J., AND BECKER, D. Scalability and Accuracy in a Large-Scale Network Emulator. In OSDI(2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. YANG, J., CHEN, T., WU, M., XU, Z., LIU, X., LIN, H., YANG, M., LONG, F., ZHANG, L., AND ZHOU, L. MODIST: Transparent Model Checking of Unmodified Distributed Systems . In NSDI (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. ZELLER, A. Yesterday, my program worked. today, it does not. why? In ESEC / SIGSOFT FSE (1999), pp. 253--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. ZHANG, X., GUPTA, N., AND GUPTA, R. Locating faults through automated predicate switching. In ICSE (New York, NY, USA, 2006), ACM, pp. 272--281. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Finding latent performance bugs in systems implementations

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
        November 2010
        302 pages
        ISBN:9781605587912
        DOI:10.1145/1882291

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 November 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate17of128submissions,13%

        Upcoming Conference

        FSE '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader