research-article

Finding latent performance bugs in systems implementations

Authors:
Charles Killian

Purdue University, West Lafayette, IN, USA

Purdue University, West Lafayette, IN, USA
View Profile

,
Karthik Nagaraj

Purdue University, West Lafayette, IN, USA

Purdue University, West Lafayette, IN, USA
View Profile

,
Salman Pervez

Purdue University, West Lafayette, IN, USA

Purdue University, West Lafayette, IN, USA
View Profile

,
Ryan Braud

University of California, San Diego, La Jolla, CA, USA

University of California, San Diego, La Jolla, CA, USA
View Profile

,
James W. Anderson

University of California, San Diego, La Jolla, CA, USA

University of California, San Diego, La Jolla, CA, USA
View Profile

,
Ranjit Jhala

University of California, San Diego, La Jolla, CA, USA

University of California, San Diego, La Jolla, CA, USA
View Profile

FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineeringNovember 2010Pages 17–26https://doi.org/10.1145/1882291.1882297

Published:07 November 2010Publication History

FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering

Pages 17–26

ABSTRACT

Robust distributed systems commonly employ high-level recovery mechanisms enabling the system to recover from a wide variety of problematic environmental conditions such as node failures, packet drops and link disconnections. Unfortunately, these recovery mechanisms also effectively mask additional serious design and implementation errors, disguising them as latent performance bugs that severely degrade end-to-end system performance. These bugs typically go unnoticed due to the challenge of distinguishing between a bug and an intermittent environmental condition that must be tolerated by the system. We present techniques that can automatically pinpoint latent performance bugs in systems implementations, in the spirit of recent advances in model checking by systematic state space exploration. The techniques proceed by automating the process of conducting random simulations, identifying performance anomalies, and analyzing anomalous executions to pinpoint the circumstances leading to performance degradation.

By focusing our implementation on the MACE toolkit, MACEPC can be used to test our implementations directly, without modification. We have applied MACEPC to five thoroughly tested and trusted distributed systems implementations. MACEPC was able to find significant, previously unknown, long-standing performance bugs in each of the systems, and led to fixes that significantly improved the end-to-end performance of the systems.

References

Bittorrent. http://bitconjurer.org/BitTorrent.Google Scholar
CADAR, C., DUNBAR, D., AND ENGLER, D. R. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI (2008). Google ScholarDigital Library
CASTRO, M., DRUSCHEL, P., KERMARREC, A.-M., NANDI, A., ROWSTRON, A., AND SINGH, A. SplitStream: High-bandwidth content distribution in cooperative environments. In SOSP (2003). Google ScholarDigital Library
DABEK, F., COX, R., KAASHOEK, F., AND MORRIS, R. Vivaldi: A decentralized network coordinate system. In SIGCOMM (Portland, Oregon, 2004). Google ScholarDigital Library
ENGLER, D. R., CHEN, D. Y., AND CHOU, A. Bugs as inconsistent behavior: A general approach to inferring errors in systems code. In SOSP (2001), pp. 57--72. Google ScholarDigital Library
GEELS, D., ALTEKAR, G., MANIATIS, P., ROSCOE, T., AND STOICA, I. Friday: Global comprehension for distributed replay. In NSDI (2007). Google ScholarDigital Library
GODEFROID, P. Model checking for programming languages using Verisoft. In POPL (1997). Google ScholarDigital Library
GODEFROID, P., KLARLUND, N., AND SEN, K. Dart: directed automated random testing. In PLDI (2005). Google ScholarDigital Library
GOLDSMITH, S., AIKEN, A., AND WILKERSON, D. S. Measuring empirical computational complexity. In ESEC/SIGSOFT FSE (2007), pp. 395--404. Google ScholarDigital Library
HAVELUND, K., AND PRESSBURGER, T. Model checking Java programs using Java Pathfinder. Software Tools for Technology Transfer (STTT) 2(4) (2000), 72--84.Google Scholar
JANNOTTI, J., GIFFORD, D. K., JOHNSON, K. L., KAASHOEK, M. F., AND JAMES W. O'TOOLE, J. Overcast: Reliable Multicasting with an Overlay Network. In OSDI(2000). Google ScholarDigital Library
KILLIAN, C., ANDERSON, J. W., BRAUD, R., JHALA, R., AND VAHDAT, A. Mace: Language support for building distributed systems. In PLDI (2007). Google ScholarDigital Library
KILLIAN, C., ANDERSON, J. W., JHALA, R., AND VAHDAT, A. Life, death, and the critical transition: Detecting liveness bugs in systems code. In NSDI (2007). Google ScholarDigital Library
KOSTIĆ, D., BRAUD, R., KILLIAN, C., VANDEKIEFT, E., ANDERSON, J. W., SNOEREN, A. C., AND VAHDAT, A. Maintaining high bandwidth under dynamic network conditions. In USENIX ATC (2005). Google ScholarDigital Library
KOSTIĆ, D., RODRIGUEZ, A., ALBRECHT, J., BHIRUD, A., AND VAHDAT, A. Using Random Subsets to Build Scalable Network Services. In USITS (2003). Google ScholarDigital Library
LAMPORT, L. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169. Google ScholarDigital Library
LUI, X., LIN, W., PAN, A., AND ZHANG, Z. Wids checker: Combating bugs in distributed systems. In NSDI (2007). Google ScholarDigital Library
MOORE, D. S., AND MCCABE, G. P. Introduction to the Practice of Statistics, 3rd ed. W.H. Freeman, New York, 1999.Google Scholar
MUSUVATHI, M., PARK, D., CHOU, A., ENGLER, D., AND DILL, D. CMC: A pragmatic approach to model checking real code. In OSDI (2002). Google ScholarDigital Library
MUSUVATHI, M., AND QADEER, S. Iterative context bounding for systematic testing of multithreaded programs. In PLDI (2007). Google ScholarDigital Library
MUSUVATHI, M., AND QADEER, S. Fair stateless model checking. In PLDI (2008). Google ScholarDigital Library
MUSUVATHI, M., QADEER, S., BALL, T., BASLER, G., NAINAR, P. A., AND NEAMTIU, I. Finding and reproducing heisenbugs in concurrent programs. In OSDI (2008). Google ScholarDigital Library
PATRICK REYNOLDS, CHARLES KILLIAN, J. L. W. J. C. M. M. A. S., AND VAHDAT, A. Pip: Detecting the unexpected in distributed systems. In NSDI (2006). Google ScholarDigital Library
RHEA, S., GEELS, D., ROSCOE, T., AND KUBIATOWICZ, J. Handling churn in a dht. In USENIX ATC (2004). Google ScholarDigital Library
RODRIGO FONSECA, GEORGE PORTER, R. H. K. S. S., AND STOICA, I. X-trace: A pervasive network tracing framework. In NSDI (2007). Google ScholarDigital Library
ROWSTRON, A., AND DRUSCHEL, P. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Middleware (2001). Google ScholarDigital Library
STOICA, I., MORRIS, R., KARGER, D., KAASHOEK, F., AND BALAKRISHNAN, H. Chord: A scalable peer to peer lookup service for internet applications. In SIGCOMM(2001). Google ScholarDigital Library
VAHDAT, A., YOCUM, K., WALSH, K., MAHADEVAN, P., KOSTI´C, D., CHASE, J., AND BECKER, D. Scalability and Accuracy in a Large-Scale Network Emulator. In OSDI(2002). Google ScholarDigital Library
YANG, J., CHEN, T., WU, M., XU, Z., LIU, X., LIN, H., YANG, M., LONG, F., ZHANG, L., AND ZHOU, L. MODIST: Transparent Model Checking of Unmodified Distributed Systems . In NSDI (2009). Google ScholarDigital Library
ZELLER, A. Yesterday, my program worked. today, it does not. why? In ESEC / SIGSOFT FSE (1999), pp. 253--267. Google ScholarDigital Library
ZHANG, X., GUPTA, N., AND GUPTA, R. Locating faults through automated predicate switching. In ICSE (New York, NY, USA, 2006), ACM, pp. 272--281. Google ScholarDigital Library

Index Terms

Finding latent performance bugs in systems implementations
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
  2. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles

Recommendations

Understanding and detecting real-world performance bugs
PLDI '12

Developers frequently use inefficient code sequences that could be fixed by simple patches. These inefficient code sequences can cause significant performance degradation and resource waste, referred to as performance bugs. Meager increases in single ...
Read More
Discovering, reporting, and fixing performance bugs
MSR '13: Proceedings of the 10th Working Conference on Mining Software Repositories

Software performance is critical for how users perceive the quality of software products. Performance bugs---programming errors that cause significant performance degradation---lead to poor user experience and low system throughput. Designing effective ...
Read More
A qualitative study on performance bugs
MSR '12: Proceedings of the 9th IEEE Working Conference on Mining Software Repositories

Software performance is one of the important qualities that makes software stand out in a competitive market. However, in earlier work we found that performance bugs take more time to fix, need to be fixed by more experienced developers and require ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
November 2010
302 pages
ISBN:9781605587912
DOI:10.1145/1882291
General Chair:
Gruia-Catalin Roman
Washington University in St. Louis, USA
,
Program Chair:
André van der Hoek
University of California, Irvine, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 November 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
debugging
distributed systems
mace
macepc
performance
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate17of128submissions,13%
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 44
  Total Citations
  View Citations
- 443
  Total Downloads
- Downloads (Last 12 months)43
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Finding latent performance bugs in systems implementations

FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Understanding and detecting real-world performance bugs

Discovering, reporting, and fixing performance bugs

A qualitative study on performance bugs