Abstract
The ability to accurately and scalably classify network traffic is of critical importance to a wide range of management tasks of large networks, such as tier-1 ISP networks and global enterprise networks. Guided by the practical constraints and requirements of traffic classification in large networks, in this article, we explore the design of an accurate and scalable machine learning based flow-level traffic classification system, which is trained on a dataset of flow-level data that has been annotated with application protocol labels by a packet-level classifier. Our system employs a lightweight modular architecture, which combines a series of simple linear binary classifiers, each of which can be efficiently implemented and trained on vast amounts of flow data in parallel, and embraces three key innovative mechanisms, weighted threshold sampling, logistic calibration, and intelligent data partitioning, to achieve scalability while attaining high accuracy. Evaluations using real traffic data from multiple locations in a large ISP show that our system accurately reproduces the labels of the packet level classifier when runs on (unlabeled) flow records, while meeting the scalability and stability requirements of large ISP networks. Using training and test datasets that are two months apart and collected from two different locations, the flow error rates are only 3% for TCP flows and 0.4% for UDP flows. We further show that such error rates can be reduced by combining the information of spatial distributions of flows, or collective traffic statistics, during classification. We propose a novel two-step model, which seamlessly integrates these collective traffic statistics into the existing traffic classification system. Experimental results display performance improvement on all traffic classes and an overall error rate reduction by 15%. In addition to a high accuracy, at runtime, our implementation easily scales to classify traffic on 10Gbps links.
- Bernaille, L., Teixeira, R., and Salamatian, K. 2006. Early application identification. In Proceedings of CoNext’06. ACM. Google ScholarDigital Library
- But, J., Nguyen, T., Stewart, L., Williams, N., and Armitage, G. 2007. Performance analysis of the angel system for automated control of game traffic prioritisation. In Proceedings of the 6th ACM SIGCOMM Workshop on Network and System Support For Games (NetGames). 123--128. Google ScholarDigital Library
- Chen, A., Jin, Y., Cao, J., and Li, L. 2010. Tracking long duration flows in network traffic. In Proceedings of the 29th Conference on Information Communications (INFOCOM). 206--210. Google ScholarDigital Library
- Crotti, M., Dusi, M., Gringoli, F., and Salgarelli, L. 2007. Traffic classification through simple statistical fingerprinting. ACM SIGCOMM Comput. Comm. Rev. 37, 1, 5--16. Google ScholarDigital Library
- Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification. Wiley-Interscience. Google ScholarDigital Library
- Erman, J., Mahanti, A., Arlitt, M. F., Cohen, I., and Williamson, C. L. 2007. Offline/realtime traffic classification using semi-supervised learning. Perform. Eval. 64, 9--12, 1194--1213. Google ScholarDigital Library
- Freund, Y. and Schapire, R. E. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the 2nd European Conference on Computational Learning Theory (EuroCOLT). Google ScholarDigital Library
- Gallagher, B. and Eliassi-Rad, T. 2007. An examination of experimental methodology for classifiers of relational data. In Proceedings of the 7th IEEE International Conference on Data Mining Workshops (ICDMW). IEEE Computer Society Press, 411--416. Google ScholarDigital Library
- Haffner, P., Sen, S., Spatscheck, O., and Wang, D. 2005. ACAS: Automated Construction of Application Signatures. In Proceedings of the SIGCOMM Workshop on Mining Network Data (MineNet). ACM. Google ScholarDigital Library
- Iliofotou, M., Pappu, P., Faloutsos, M., Mitzenmacher, M., Singh, S., and Varghese, G. 2007. Network monitoring using traffic dispersion graphs (TDGS). In Proceedings of the ACM Internet Measurement Conference (IMC). Google ScholarDigital Library
- Iliofotou, M., Faloutsos, M., and Mitzenmacher, M. 2009a. Exploiting dynamicity in graph-based traffic analysis: techniques and applications. In Proceedings of CoNext’09. ACM. Google ScholarDigital Library
- Iliofotou, M., Kim, H., Faloutsos, M., Mitzenmacher, M., Pappu, P., and Varghese, G. 2009b. Graph-based p2p traffic classification at the internet backbone. In Proceedings of the 28th IEEE International Conference on Computer Communications Workshops (INFOCOM). 37--42. Google ScholarDigital Library
- Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixture of local experts. Neural Computat. 3, 79--87. Google ScholarCross Ref
- Jiang, H., Moore, A. W., Ge, Z., Jin, S., and Wang, J. 2007. Lightweight application classification for network management. In Proceedings of the 2007 SIGCOMM Workshop on Internet Network Management (INM). ACM. Google ScholarDigital Library
- Jin, Y., Sharafuddin, E., and Zhang, Z.-L. 2009. Unveiling core network-wide communication patterns through application traffic activity graph decomposition. In Proceedings of SIGMETRICS’09. 49--60. Google ScholarDigital Library
- Jin, Y., Duffield, N., Haffner, P., Sen, S., and Zhang, Z.-L. 2010a. Inferring applications at the network layer using collective traffic statistics. In Proceedings of the 22nd International Teletraffic Congress (ITC’22).Google Scholar
- Jin, Y., Duffield, N., Haffner, P., Sen, S., and Zhang, Z.-L. 2010b. Inferring applications at the network layer using collective traffic statistics (extended abstract). In Proceedings of ACM SIGMETRICS’10. Google ScholarDigital Library
- Karagiannis, T., Broido, A., Faloutsos, M., and Claffy, K. 2004. Transport layer identification of P2P traffic. In Proceedings of the ACM Internet Measurement Conference (IMC). Google ScholarDigital Library
- Karagiannis, T., Papagiannaki, K., and Faloutsos, M. 2005. BLINC: Multilevel traffic classification in the dark. In Proceedings of ACM SIGCOMM’05. ACM. Google ScholarDigital Library
- Ma, J., Levchenko, K., Kreibich, C., Savage, S., and Voelker, G. M. 2006. Unexpected means of protocol inference. In Proceedings of the ACM Internet Measurement Conference (IMC). Google ScholarDigital Library
- McDaniel, P., Sen, S., Spatscheck, O., der Merwe, J. V., Aiello, B., and Kalmanek, C. 2006. Enterprise security: A community of interest based approach. In Proceedings of the 13th Annual Network and Distributed System Security Symposium (NDSS).Google Scholar
- Moore, A. W. and Zuev, D. 2005. Internet traffic classification using Bayesian analysis techniques. In Proceedings of ACM SIGMETRICS’05. Google ScholarDigital Library
- Neville, J. and Jensen, D. 2000. Iterative classification in relational data. In Proceedings of the AAAI Workshop on Learning Statistical Models from Relational Data. AAAI.Google Scholar
- Nguyen, T. and Armitage, G. 2006a. Synthetic sub-flow pairs for timely and stable IP traffic identification. In Proceedings of the Australian Telecommunication Networks and Application Conference. 293--297.Google Scholar
- Nguyen, T. and Armitage, G. 2006b. Training on multiple sub-flows to optimise the use of machine learning classifiers in real-world IP networks. In Proceedings of the 31st Conference on Local Computer Networks. IEEE.Google Scholar
- Phillips, S. J., Dudík, M., and Schapire, R. E. 2004. A maximum entropy approach to species distribution modeling. In Proceedings of the 21st International Conference on Machine Learning (ICML). ACM. Google ScholarDigital Library
- Platt, J. 1999. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Proceedings of the 13th Conference on Neural Information Processing Systems (NIPS).Google Scholar
- Rifkin, R. and Klautau, A. 2004. In defense of one-vs-all classification. J. Mach. Learn. Res., 101--141. Google ScholarDigital Library
- Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Mach. Learn. 39, 2--3, 135--168. Google ScholarDigital Library
- Sen, S., Spatscheck, O., and Wang, D. 2004. Accurate, scalable in-network identification of P2P traffic using application signatures. In Proceedings of the 13th International World Wide Web Conference (WWW). ACM. Google ScholarDigital Library
- Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., and Eliassi-Rad, T. 2008. Collective classification in network data. AI Mag. 29, 3.Google ScholarCross Ref
- Trestian, I., Ranjan, S., Kuzmanovi, A., and Nucci, A. 2008. Unconstrained endpoint profiling (Googling the Internet). In Proceedings of ACM SIGCOMM ’08. Google ScholarDigital Library
- Williams, N., Zander, S., and Armitage, G. 2006. A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. SIGCOMM Comput. Comm. Rev. 36, 5--16. Google ScholarDigital Library
- Witten, I. H. and Frank, E. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. Google ScholarDigital Library
- Xu, K., Zhang, Z.-L., and Bhattacharyya, S. 2005. Profiling internet backbone traffic: Behavior models and applications. In Proceedings of ACM SIGCOMM. Google ScholarDigital Library
Index Terms
- A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks
Recommendations
Learning for accurate classification of real-time traffic
CoNEXT '06: Proceedings of the 2006 ACM CoNEXT conferenceAccurate network traffic classification is an important task. We intend to develop an intelligent classification system by learning the types of service inside a network flow using machine learning techniques. Previous work used Bayesian methods for ...
QoS-aware Traffic Classification Architecture Using Machine Learning and Deep Packet Inspection in SDNs
The QoS-aware traffic classification techniques of SDN networks is the basis for network to provide fine-grained QoS traffic engineering. In this paper, we propose an architecture which combines deep packet detection and semi-supervised machine learning ...
Machine learning algorithms for accurate flow-based network traffic classification: Evaluation and comparison
The task of network management and monitoring relies on an accurate characterization of network traffic generated by different applications and network protocols. We employ three supervised machine learning (ML) algorithms, Bayesian Networks, Decision ...
Comments