research-article

A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks

Authors:
Yu Jin

University of Minnesota

University of Minnesota
View Profile

,
Nick Duffield

AT&T Labs -- Research

AT&T Labs -- Research
View Profile

,
Jeffrey Erman

AT&T Labs -- Research

AT&T Labs -- Research
View Profile

,
Patrick Haffner

AT&T Labs -- Research

AT&T Labs -- Research
View Profile

,
Subhabrata Sen

AT&T Labs -- Research

AT&T Labs -- Research
View Profile

,
Zhi-Li Zhang

University of Minnesota

University of Minnesota
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 6 Issue 1Article No.: 4pp 1–34https://doi.org/10.1145/2133360.2133364

Published:01 March 2012Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

The ability to accurately and scalably classify network traffic is of critical importance to a wide range of management tasks of large networks, such as tier-1 ISP networks and global enterprise networks. Guided by the practical constraints and requirements of traffic classification in large networks, in this article, we explore the design of an accurate and scalable machine learning based flow-level traffic classification system, which is trained on a dataset of flow-level data that has been annotated with application protocol labels by a packet-level classifier. Our system employs a lightweight modular architecture, which combines a series of simple linear binary classifiers, each of which can be efficiently implemented and trained on vast amounts of flow data in parallel, and embraces three key innovative mechanisms, weighted threshold sampling, logistic calibration, and intelligent data partitioning, to achieve scalability while attaining high accuracy. Evaluations using real traffic data from multiple locations in a large ISP show that our system accurately reproduces the labels of the packet level classifier when runs on (unlabeled) flow records, while meeting the scalability and stability requirements of large ISP networks. Using training and test datasets that are two months apart and collected from two different locations, the flow error rates are only 3% for TCP flows and 0.4% for UDP flows. We further show that such error rates can be reduced by combining the information of spatial distributions of flows, or collective traffic statistics, during classification. We propose a novel two-step model, which seamlessly integrates these collective traffic statistics into the existing traffic classification system. Experimental results display performance improvement on all traffic classes and an overall error rate reduction by 15%. In addition to a high accuracy, at runtime, our implementation easily scales to classify traffic on 10Gbps links.

References

Bernaille, L., Teixeira, R., and Salamatian, K. 2006. Early application identification. In Proceedings of CoNext’06. ACM. Google ScholarDigital Library
But, J., Nguyen, T., Stewart, L., Williams, N., and Armitage, G. 2007. Performance analysis of the angel system for automated control of game traffic prioritisation. In Proceedings of the 6th ACM SIGCOMM Workshop on Network and System Support For Games (NetGames). 123--128. Google ScholarDigital Library
Chen, A., Jin, Y., Cao, J., and Li, L. 2010. Tracking long duration flows in network traffic. In Proceedings of the 29th Conference on Information Communications (INFOCOM). 206--210. Google ScholarDigital Library
Crotti, M., Dusi, M., Gringoli, F., and Salgarelli, L. 2007. Traffic classification through simple statistical fingerprinting. ACM SIGCOMM Comput. Comm. Rev. 37, 1, 5--16. Google ScholarDigital Library
Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification. Wiley-Interscience. Google ScholarDigital Library
Erman, J., Mahanti, A., Arlitt, M. F., Cohen, I., and Williamson, C. L. 2007. Offline/realtime traffic classification using semi-supervised learning. Perform. Eval. 64, 9--12, 1194--1213. Google ScholarDigital Library
Freund, Y. and Schapire, R. E. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the 2nd European Conference on Computational Learning Theory (EuroCOLT). Google ScholarDigital Library
Gallagher, B. and Eliassi-Rad, T. 2007. An examination of experimental methodology for classifiers of relational data. In Proceedings of the 7th IEEE International Conference on Data Mining Workshops (ICDMW). IEEE Computer Society Press, 411--416. Google ScholarDigital Library
Haffner, P., Sen, S., Spatscheck, O., and Wang, D. 2005. ACAS: Automated Construction of Application Signatures. In Proceedings of the SIGCOMM Workshop on Mining Network Data (MineNet). ACM. Google ScholarDigital Library
Iliofotou, M., Pappu, P., Faloutsos, M., Mitzenmacher, M., Singh, S., and Varghese, G. 2007. Network monitoring using traffic dispersion graphs (TDGS). In Proceedings of the ACM Internet Measurement Conference (IMC). Google ScholarDigital Library
Iliofotou, M., Faloutsos, M., and Mitzenmacher, M. 2009a. Exploiting dynamicity in graph-based traffic analysis: techniques and applications. In Proceedings of CoNext’09. ACM. Google ScholarDigital Library
Iliofotou, M., Kim, H., Faloutsos, M., Mitzenmacher, M., Pappu, P., and Varghese, G. 2009b. Graph-based p2p traffic classification at the internet backbone. In Proceedings of the 28th IEEE International Conference on Computer Communications Workshops (INFOCOM). 37--42. Google ScholarDigital Library
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixture of local experts. Neural Computat. 3, 79--87. Google ScholarCross Ref
Jiang, H., Moore, A. W., Ge, Z., Jin, S., and Wang, J. 2007. Lightweight application classification for network management. In Proceedings of the 2007 SIGCOMM Workshop on Internet Network Management (INM). ACM. Google ScholarDigital Library
Jin, Y., Sharafuddin, E., and Zhang, Z.-L. 2009. Unveiling core network-wide communication patterns through application traffic activity graph decomposition. In Proceedings of SIGMETRICS’09. 49--60. Google ScholarDigital Library
Jin, Y., Duffield, N., Haffner, P., Sen, S., and Zhang, Z.-L. 2010a. Inferring applications at the network layer using collective traffic statistics. In Proceedings of the 22nd International Teletraffic Congress (ITC’22).Google Scholar
Jin, Y., Duffield, N., Haffner, P., Sen, S., and Zhang, Z.-L. 2010b. Inferring applications at the network layer using collective traffic statistics (extended abstract). In Proceedings of ACM SIGMETRICS’10. Google ScholarDigital Library
Karagiannis, T., Broido, A., Faloutsos, M., and Claffy, K. 2004. Transport layer identification of P2P traffic. In Proceedings of the ACM Internet Measurement Conference (IMC). Google ScholarDigital Library
Karagiannis, T., Papagiannaki, K., and Faloutsos, M. 2005. BLINC: Multilevel traffic classification in the dark. In Proceedings of ACM SIGCOMM’05. ACM. Google ScholarDigital Library
Ma, J., Levchenko, K., Kreibich, C., Savage, S., and Voelker, G. M. 2006. Unexpected means of protocol inference. In Proceedings of the ACM Internet Measurement Conference (IMC). Google ScholarDigital Library
McDaniel, P., Sen, S., Spatscheck, O., der Merwe, J. V., Aiello, B., and Kalmanek, C. 2006. Enterprise security: A community of interest based approach. In Proceedings of the 13th Annual Network and Distributed System Security Symposium (NDSS).Google Scholar
Moore, A. W. and Zuev, D. 2005. Internet traffic classification using Bayesian analysis techniques. In Proceedings of ACM SIGMETRICS’05. Google ScholarDigital Library
Neville, J. and Jensen, D. 2000. Iterative classification in relational data. In Proceedings of the AAAI Workshop on Learning Statistical Models from Relational Data. AAAI.Google Scholar
Nguyen, T. and Armitage, G. 2006a. Synthetic sub-flow pairs for timely and stable IP traffic identification. In Proceedings of the Australian Telecommunication Networks and Application Conference. 293--297.Google Scholar
Nguyen, T. and Armitage, G. 2006b. Training on multiple sub-flows to optimise the use of machine learning classifiers in real-world IP networks. In Proceedings of the 31st Conference on Local Computer Networks. IEEE.Google Scholar
Phillips, S. J., Dudík, M., and Schapire, R. E. 2004. A maximum entropy approach to species distribution modeling. In Proceedings of the 21st International Conference on Machine Learning (ICML). ACM. Google ScholarDigital Library
Platt, J. 1999. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Proceedings of the 13th Conference on Neural Information Processing Systems (NIPS).Google Scholar
Rifkin, R. and Klautau, A. 2004. In defense of one-vs-all classification. J. Mach. Learn. Res., 101--141. Google ScholarDigital Library
Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Mach. Learn. 39, 2--3, 135--168. Google ScholarDigital Library
Sen, S., Spatscheck, O., and Wang, D. 2004. Accurate, scalable in-network identification of P2P traffic using application signatures. In Proceedings of the 13th International World Wide Web Conference (WWW). ACM. Google ScholarDigital Library
Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., and Eliassi-Rad, T. 2008. Collective classification in network data. AI Mag. 29, 3.Google ScholarCross Ref
Trestian, I., Ranjan, S., Kuzmanovi, A., and Nucci, A. 2008. Unconstrained endpoint profiling (Googling the Internet). In Proceedings of ACM SIGCOMM ’08. Google ScholarDigital Library
Williams, N., Zander, S., and Armitage, G. 2006. A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. SIGCOMM Comput. Comm. Rev. 36, 5--16. Google ScholarDigital Library
Witten, I. H. and Frank, E. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. Google ScholarDigital Library
Xu, K., Zhang, Z.-L., and Bhattacharyya, S. 2005. Profiling internet backbone traffic: Behavior models and applications. In Proceedings of ACM SIGCOMM. Google ScholarDigital Library

Index Terms

A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Networks
  1. Network services
    1. Network management

Recommendations

Learning for accurate classification of real-time traffic
CoNEXT '06: Proceedings of the 2006 ACM CoNEXT conference

Accurate network traffic classification is an important task. We intend to develop an intelligent classification system by learning the types of service inside a network flow using machine learning techniques. Previous work used Bayesian methods for ...
Read More
QoS-aware Traffic Classification Architecture Using Machine Learning and Deep Packet Inspection in SDNs

The QoS-aware traffic classification techniques of SDN networks is the basis for network to provide fine-grained QoS traffic engineering. In this paper, we propose an architecture which combines deep packet detection and semi-supervised machine learning ...
Read More
Machine learning algorithms for accurate flow-based network traffic classification: Evaluation and comparison

The task of network management and monitoring relies on an accurate characterization of network traffic generated by different applications and network protocols. We employ three supervised machine learning (ML) algorithms, Bayesian Networks, Decision ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 6, Issue 1
March 2012
137 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2133360
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2012
- Revised: 1 June 2011
- Accepted: 1 June 2011
- Received: 1 June 2010
Published in tkdd Volume 6, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Communications network
machine learning
traffic classification
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 64
  Total Citations
  View Citations
- 1,728
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Learning for accurate classification of real-time traffic

QoS-aware Traffic Classification Architecture Using Machine Learning and Deep Packet Inspection in SDNs

Machine learning algorithms for accurate flow-based network traffic classification: Evaluation and comparison

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Learning for accurate classification of real-time traffic

QoS-aware Traffic Classification Architecture Using Machine Learning and Deep Packet Inspection in SDNs

Machine learning algorithms for accurate flow-based network traffic classification: Evaluation and comparison

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media