ABSTRACT
Disk-to-disk wide-area file transfers involve many subsystems and tunable application parameters that pose significant challenges for bottleneck detection, system optimization, and performance prediction. Performance models can be used to address these challenges but have not proved generally usable because of a need for extensive online experiments to characterize subsystems. We show here how to overcome the need for such experiments by applying machine learning methods to historical data to estimate parameters for predictive models. Starting with log data for millions of Globus transfers involving billions of files and hundreds of petabytes, we engineer features for endpoint CPU load, network interface card load, and transfer characteristics; and we use these features in both linear and nonlinear models of transfer performance, We show that the resulting models have high explanatory power. For a representative set of 30,653 transfers over 30 heavily used source-destination pairs ("edges''),totaling 2,053 TB in 46.6 million files, we obtain median absolute percentage prediction errors (MdAPE) of 7.0% and 4.6% when using distinct linear and nonlinear models per edge, respectively; when using a single nonlinear model for all edges, we obtain an MdAPE of 7.8%. Our work broadens understanding of factors that influence file transfer rate by clarifying relationships between achieved transfer rates, transfer characteristics, and competing load. Our predictions can be used for distributed workflow scheduling and optimization, and our features can also be used for optimization and explanation.
- W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, and I. Foster. The Globus striped GridFTP framework and server. In SC'05, pages 54--61, 2005. Google ScholarDigital Library
- B. Allen, J. Bresnahan, L. Childers, I. Foster, G. Kandaswamy, R. Kettimuthu, J. Kordas, M. Link, S. Martin, K. Pickett, and S. Tuecke. Software as a service for data scientists. Commun. ACM, 55(2):81--88, Feb. 2012. Google ScholarDigital Library
- E. Altman, D. Barman, B. Tuffin, and M. Vojnovic. Parallel TCP sockets: Simple model, throughput and validation. In 25th IEEE Intl Conf.\ on Computer Communications, pages 1--12, April 2006.Google ScholarCross Ref
- E. Arslan, K. Guner, and T. Kosar. HARP: predictive transfer optimization based on historical analysis and real-time probing. In SC'16, pages 25:1--25:12, 2016. Google ScholarDigital Library
- P. Balaprakash, A. Tiwari, S. M. Wild, and P. D. Hovland. AutoMOMML: Automatic Multi-objective Modeling with Machine Learning. In ISC, pages 219--239, 2016.Google ScholarCross Ref
- BBCP. http://www.slac.stanford.edu/ abh/bbcp/.Google Scholar
- P. H. Carns, B. W. Settlemyer, and W. B. Ligon III. Using server-to-server communication in parallel file systems to simplify consistency and improve performance. In SC'08, page 6, 2008. Google ScholarDigital Library
- K. Chard, S. Tuecke, and I. Foster. Globus: Recent enhancements and future plans. In XSEDE'16, page 27. ACM, 2016. Google ScholarDigital Library
- T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. arXiv preprint arXiv:1603.02754, 2016. Google ScholarDigital Library
- J. Crowcroft and P. Oechslin. Differentiated end-to-end internet services using a weighted proportional fair sharing TCP. SIGCOMM Comput. Commun. Rev., 28(3):53--69, July 1998. Google ScholarDigital Library
- E. Dart, L. Rotman, B. Tierney, M. Hester, and J. Zurawski. The Science DMZ: A network design pattern for data-intensive science. Scientific Programming, 22(2):173--185, 2014.Google ScholarDigital Library
- FDT. FDT - Fast Data Transfer. http://monalisa.cern.ch/FDT/.Google Scholar
- J. Gao and N. S. V. Rao. TCP AIMD dynamics over Internet connections. IEEE Communications Letters, 9:4--6, 2005.Google ScholarCross Ref
- I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157--1182, Mar. 2003. Google ScholarDigital Library
- T. J. Hacker, B. D. Athey, and B. Noble. The end-to-end performance effects of parallel TCP sockets on a lossy wide-area network. In 16th Intl Parallel and Distributed Processing Symp., page 314, 2002. Google ScholarDigital Library
- A. Hanemann, J. W. Boote, E. L. Boyd, J. Durand, L. Kudarimoti, R. Lapacz, D. M. Swany, S. Trocha, and J. Zurawski. PerfSONAR: A service oriented architecture for multi-domain network monitoring. In 3rd Intl Conf.\ on Service-Oriented Computing, pages 241--254, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarDigital Library
- iperf3. http://software.es.net/iperf/.Google Scholar
- T. Ito, H. Ohsaki, and M. Imase. GridFTP-APT: Automatic parallelism tuning mechanism for data transfer protocol GridFTP. In 6th IEEE Intl Symp.\ on Cluster Computing and the Grid, pages 454--461, 2006. Google ScholarDigital Library
- E.-S. Jung, R. Kettimuthu, and V. Vishwanath. Toward optimizing disk-to-disk transfer on 100G networks. In 7th IEEE Intl Conf.\ on Advanced Networks and Telecommunications Systems, 2013.Google ScholarCross Ref
- T. Kelly. Scalable TCP: Improving performance in highspeed wide area networks. ACM SIGCOMM Computer Communication Review, 33(2):83--91, 2003. Google ScholarDigital Library
- R. Kettimuthu, G. Vardoyan, G. Agrawal, and P. Sadayappan. Modeling and optimizing large-scale wide-area data transfers. 14th IEEE/ACM Intl Symp.\ on Cluster, Cloud and Grid Computing, 0:196--205, 2014.Google Scholar
- J. Kim, E. Yildirim, and T. Kosar. A highly-accurate and low-overhead prediction model for transfer throughput optimization. Cluster Computing, 18(1):41--59, 2015. Google ScholarDigital Library
- E. Kissel, M. Swany, B. Tierney, and E. Pouyoul. Efficient wide area data transfer protocols for 100 Gbps networks and beyond. In 3rd Intl Workshop on Network-Aware Data Management, page 3. ACM, 2013. Google ScholarDigital Library
- G. Kola and M. K. Vernon. Target bandwidth sharing using endhost measures. Perform. Eval., 64(9--12):948--964, Oct. 2007. Google ScholarDigital Library
- T. Kosar, G. Kola, and M. Livny. Data pipelines: Enabling large scale multi-protocol data transfers. In 2nd Workshop on Middleware for Grid Computing, pages 63--68, 2004. Google ScholarDigital Library
- N. Liu, C. Carothers, J. Cope, P. Carns, R. Ross, A. Crume, and C. Maltzahn. Modeling a leadership-scale storage system. In Parallel Processing and Applied Mathematics, pages 10--19. 2012. Google ScholarDigital Library
- Z. Liu, P. Balaprakash, R. Kettimuthu, and I. Foster. Explaining wide area data transfer performance. http://hdl.handle.net/11466/globus_A4N55BB, 2017.Google Scholar
- D. Lu, Y. Qiao, P. Dinda, and F. Bustamante. Characterizing and predicting TCP throughput on the wide area network. In 25th IEEE Intl Conf.\ on Distributed Computing Systems, pages 414--424, June 2005. Google ScholarDigital Library
- D. Lu, Y. Qiao, P. A. Dinda, and F. E. Bustamante. Modeling and taming parallel TCP on the wide area network. In 19th IEEE Intl Parallel and Distributed Processing Symp., page 68b, 2005. Google ScholarDigital Library
- H. Ohsaki and M. Imase. On modeling GridFTP using fluid-flow approximation for high speed Grid networking. In Symp.\ on Applications and the Internet--Workshops, pages 638--, 2004. Google ScholarDigital Library
- J. Padhye, V. Firoiu, D. F. Towsley, and J. F. Kurose. Modeling TCP Reno performance: A simple model and its empirical validation. IEEE/ACM Trans.\ Networking, 8(2):133--145, 2000. Google ScholarDigital Library
- B. W. Settlemyer, J. D. Dobson, S. W. Hodson, J. A. Kuehn, S. W. Poole, and T. M. Ruwart. A technique for moving large data sets over high-performance long distance networks. In 27th Symp.\ on Mass Storage Systems and Technologies, pages 1--6, May 2011. Google ScholarDigital Library
- B. Tierney, W. Johnston, B. Crowley, G. Hoo, C. Brooks, and D. Gunter. The NetLogger methodology for high performance distributed systems performance analysis. In 7th Intl Symp.\ on High Performance Distributed Computing, pages 260--267, 1998. Google ScholarDigital Library
- G. Vardoyan, N. S. V. Rao, and D. Towsley. Models of TCP in high-BDP environments and their experimental validation. In 24th Intl Conf.\ on Network Protocols, pages 1--10, 2016.Google ScholarCross Ref
- S. Vazhkudai and J. Schopf. Using regression techniques to predict large data transfers. Int. J. High Perf. Comp. Appl., 2003. Google ScholarDigital Library
- D. X. Wei, C. Jin, S. H. Low, and S. Hegde. FAST TCP: Motivation, architecture, algorithms, performance. IEEE/ACM Trans.\ Networking, 14(6):1246--1259, 2006. Google ScholarDigital Library
- W. Weibull. A statistical distribution function of wide applicability. Journal of Applied Mechanics, pages 293--297, 1951.Google Scholar
- R. Wolski. Forecasting network performance to support dynamic scheduling using the Network Weather Service. In 6th IEEE Symp.\ on High Performance Distributed Computing, 1997. Google ScholarDigital Library
- J. M. Wozniak, S. W. Son, and R. Ross. Distributed object storage rebuild analysis via simulation with GOBS. In Intl Conf.\ on Dependable Systems and Networks Workshops, pages 23--28, 2010. Google ScholarDigital Library
- Q. M. Wu, K. Xie, M. F. Zhu, L. M. Xiao, and L. Ruan. DMFSsim: A distributed metadata file system simulator. Applied Mechanics and Materials, 241:1556--1561, 2013.Google Scholar
- E. Yildirim, D. Yin, and T. Kosar. Prediction of optimal parallelism level in wide area data transfers. IEEE Trans. Parallel Distrib. Syst., 22(12):2033--2045, Dec. 2011. Google ScholarDigital Library
Index Terms
- Explaining Wide Area Data Transfer Performance
Recommendations
Transferring a petabyte in a day
AbstractExtreme-scale simulations and experiments can generate large amounts of data, whose volume can exceed the compute and/or storage capacity at the simulation or experimental facility. With the emergence of ultra-high-speed networks, ...
Highlights- Lessons learned from a demo that needs timely big data transfer (1 petabyte per day).
Data management and transfer in high-performance computational grid environments
Parallel data-intensive algorithms and applicationsAn emerging class of data-intensive applications involve the geographically dispersed extraction of complex scientific information from very large collections of measured or computed data. Such applications arise, for example, in experimental physics, ...
Specialized file transfer service for large oil&gas datasets
CCGRID '13: Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid ComputingGridFTP, included in the Globus Toolkit, is a high-performance, reliable and secure data transfer protocol used in Grid Computing. SETA, an acronym in Portuguese for Specialized File Transfer Service, is a web application developed using the Globus Java ...
Comments