skip to main content
research-article

A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks

Published:01 March 2012Publication History
Skip Abstract Section

Abstract

The ability to accurately and scalably classify network traffic is of critical importance to a wide range of management tasks of large networks, such as tier-1 ISP networks and global enterprise networks. Guided by the practical constraints and requirements of traffic classification in large networks, in this article, we explore the design of an accurate and scalable machine learning based flow-level traffic classification system, which is trained on a dataset of flow-level data that has been annotated with application protocol labels by a packet-level classifier. Our system employs a lightweight modular architecture, which combines a series of simple linear binary classifiers, each of which can be efficiently implemented and trained on vast amounts of flow data in parallel, and embraces three key innovative mechanisms, weighted threshold sampling, logistic calibration, and intelligent data partitioning, to achieve scalability while attaining high accuracy. Evaluations using real traffic data from multiple locations in a large ISP show that our system accurately reproduces the labels of the packet level classifier when runs on (unlabeled) flow records, while meeting the scalability and stability requirements of large ISP networks. Using training and test datasets that are two months apart and collected from two different locations, the flow error rates are only 3% for TCP flows and 0.4% for UDP flows. We further show that such error rates can be reduced by combining the information of spatial distributions of flows, or collective traffic statistics, during classification. We propose a novel two-step model, which seamlessly integrates these collective traffic statistics into the existing traffic classification system. Experimental results display performance improvement on all traffic classes and an overall error rate reduction by 15%. In addition to a high accuracy, at runtime, our implementation easily scales to classify traffic on 10Gbps links.

References

  1. Bernaille, L., Teixeira, R., and Salamatian, K. 2006. Early application identification. In Proceedings of CoNext’06. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. But, J., Nguyen, T., Stewart, L., Williams, N., and Armitage, G. 2007. Performance analysis of the angel system for automated control of game traffic prioritisation. In Proceedings of the 6th ACM SIGCOMM Workshop on Network and System Support For Games (NetGames). 123--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chen, A., Jin, Y., Cao, J., and Li, L. 2010. Tracking long duration flows in network traffic. In Proceedings of the 29th Conference on Information Communications (INFOCOM). 206--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Crotti, M., Dusi, M., Gringoli, F., and Salgarelli, L. 2007. Traffic classification through simple statistical fingerprinting. ACM SIGCOMM Comput. Comm. Rev. 37, 1, 5--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification. Wiley-Interscience. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Erman, J., Mahanti, A., Arlitt, M. F., Cohen, I., and Williamson, C. L. 2007. Offline/realtime traffic classification using semi-supervised learning. Perform. Eval. 64, 9--12, 1194--1213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Freund, Y. and Schapire, R. E. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the 2nd European Conference on Computational Learning Theory (EuroCOLT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gallagher, B. and Eliassi-Rad, T. 2007. An examination of experimental methodology for classifiers of relational data. In Proceedings of the 7th IEEE International Conference on Data Mining Workshops (ICDMW). IEEE Computer Society Press, 411--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Haffner, P., Sen, S., Spatscheck, O., and Wang, D. 2005. ACAS: Automated Construction of Application Signatures. In Proceedings of the SIGCOMM Workshop on Mining Network Data (MineNet). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Iliofotou, M., Pappu, P., Faloutsos, M., Mitzenmacher, M., Singh, S., and Varghese, G. 2007. Network monitoring using traffic dispersion graphs (TDGS). In Proceedings of the ACM Internet Measurement Conference (IMC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Iliofotou, M., Faloutsos, M., and Mitzenmacher, M. 2009a. Exploiting dynamicity in graph-based traffic analysis: techniques and applications. In Proceedings of CoNext’09. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Iliofotou, M., Kim, H., Faloutsos, M., Mitzenmacher, M., Pappu, P., and Varghese, G. 2009b. Graph-based p2p traffic classification at the internet backbone. In Proceedings of the 28th IEEE International Conference on Computer Communications Workshops (INFOCOM). 37--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixture of local experts. Neural Computat. 3, 79--87. Google ScholarGoogle ScholarCross RefCross Ref
  14. Jiang, H., Moore, A. W., Ge, Z., Jin, S., and Wang, J. 2007. Lightweight application classification for network management. In Proceedings of the 2007 SIGCOMM Workshop on Internet Network Management (INM). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jin, Y., Sharafuddin, E., and Zhang, Z.-L. 2009. Unveiling core network-wide communication patterns through application traffic activity graph decomposition. In Proceedings of SIGMETRICS’09. 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jin, Y., Duffield, N., Haffner, P., Sen, S., and Zhang, Z.-L. 2010a. Inferring applications at the network layer using collective traffic statistics. In Proceedings of the 22nd International Teletraffic Congress (ITC’22).Google ScholarGoogle Scholar
  17. Jin, Y., Duffield, N., Haffner, P., Sen, S., and Zhang, Z.-L. 2010b. Inferring applications at the network layer using collective traffic statistics (extended abstract). In Proceedings of ACM SIGMETRICS’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Karagiannis, T., Broido, A., Faloutsos, M., and Claffy, K. 2004. Transport layer identification of P2P traffic. In Proceedings of the ACM Internet Measurement Conference (IMC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Karagiannis, T., Papagiannaki, K., and Faloutsos, M. 2005. BLINC: Multilevel traffic classification in the dark. In Proceedings of ACM SIGCOMM’05. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ma, J., Levchenko, K., Kreibich, C., Savage, S., and Voelker, G. M. 2006. Unexpected means of protocol inference. In Proceedings of the ACM Internet Measurement Conference (IMC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. McDaniel, P., Sen, S., Spatscheck, O., der Merwe, J. V., Aiello, B., and Kalmanek, C. 2006. Enterprise security: A community of interest based approach. In Proceedings of the 13th Annual Network and Distributed System Security Symposium (NDSS).Google ScholarGoogle Scholar
  22. Moore, A. W. and Zuev, D. 2005. Internet traffic classification using Bayesian analysis techniques. In Proceedings of ACM SIGMETRICS’05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Neville, J. and Jensen, D. 2000. Iterative classification in relational data. In Proceedings of the AAAI Workshop on Learning Statistical Models from Relational Data. AAAI.Google ScholarGoogle Scholar
  24. Nguyen, T. and Armitage, G. 2006a. Synthetic sub-flow pairs for timely and stable IP traffic identification. In Proceedings of the Australian Telecommunication Networks and Application Conference. 293--297.Google ScholarGoogle Scholar
  25. Nguyen, T. and Armitage, G. 2006b. Training on multiple sub-flows to optimise the use of machine learning classifiers in real-world IP networks. In Proceedings of the 31st Conference on Local Computer Networks. IEEE.Google ScholarGoogle Scholar
  26. Phillips, S. J., Dudík, M., and Schapire, R. E. 2004. A maximum entropy approach to species distribution modeling. In Proceedings of the 21st International Conference on Machine Learning (ICML). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Platt, J. 1999. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Proceedings of the 13th Conference on Neural Information Processing Systems (NIPS).Google ScholarGoogle Scholar
  28. Rifkin, R. and Klautau, A. 2004. In defense of one-vs-all classification. J. Mach. Learn. Res., 101--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Mach. Learn. 39, 2--3, 135--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sen, S., Spatscheck, O., and Wang, D. 2004. Accurate, scalable in-network identification of P2P traffic using application signatures. In Proceedings of the 13th International World Wide Web Conference (WWW). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., and Eliassi-Rad, T. 2008. Collective classification in network data. AI Mag. 29, 3.Google ScholarGoogle ScholarCross RefCross Ref
  32. Trestian, I., Ranjan, S., Kuzmanovi, A., and Nucci, A. 2008. Unconstrained endpoint profiling (Googling the Internet). In Proceedings of ACM SIGCOMM ’08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Williams, N., Zander, S., and Armitage, G. 2006. A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. SIGCOMM Comput. Comm. Rev. 36, 5--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Witten, I. H. and Frank, E. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xu, K., Zhang, Z.-L., and Bhattacharyya, S. 2005. Profiling internet backbone traffic: Behavior models and applications. In Proceedings of ACM SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Knowledge Discovery from Data
          ACM Transactions on Knowledge Discovery from Data  Volume 6, Issue 1
          March 2012
          137 pages
          ISSN:1556-4681
          EISSN:1556-472X
          DOI:10.1145/2133360
          Issue’s Table of Contents

          Copyright © 2012 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 March 2012
          • Revised: 1 June 2011
          • Accepted: 1 June 2011
          • Received: 1 June 2010
          Published in tkdd Volume 6, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader