ABSTRACT
MapReduce is increasingly gaining popularity as a programming model for use in large-scale distributed processing. The model is most widely used when implemented using the Hadoop Distributed File System (HDFS). The use of the HDFS, however, precludes the direct applicability of the model to HPC environments, which use high performance distributed file systems. In such distributed environments, the MapReduce model can rarely make use of full resources, as local disks may not be available for data placement on all the nodes. This work proposes a MapReduce implementation and design choices directly suitable for such HPC environments.
- Apache Hadoop. http://hadoop.apache.org.Google Scholar
- Fermilab Computing Division, FermiGrid. http://fermigrid.fnal.gov/.Google Scholar
- Microsoft Research. http://www.microsoft.com/windowsazure/.Google Scholar
- National Energy Research Scientific Computing Center. http://www.nersc.gov.Google Scholar
- Open Science Grid. http://www.opensciencegrid.org.Google Scholar
- TeraGrid Information Services. http://info.teragrid.org/.Google Scholar
- Amazon. Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 810--818, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- L. Heshan, A. Ma, and M. Feng. Moon: Mapreduce on opportunistic environments. In HPDC '10: the ACM International Symposium on High Performance Distributed Computing. ACM, 2010. Google ScholarDigital Library
- Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, 1994.Google Scholar
- R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon. Design and implementation or the sun network filesystem, 1985.Google Scholar
- F. Schmuck and R. Haskin. Gpfs: A shared-disk file system for large computing clusters. In In Proceedings of the 2002 Conference on File and Storage Technologies (FAST, pages 231--244, 2002. Google ScholarDigital Library
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1--10, May 2010. Google ScholarDigital Library
- S. R. Soltis, G. M. Erickson, K. W. Preslan, M. T. O'keefe, and T. M. Ruwart. The global file system: A file system for shared disk storage, 1997.Google Scholar
Index Terms
- Adapting MapReduce for HPC environments
Recommendations
A Performance Analysis of MapReduce Task with Large Number of Files Dataset in Big Data Using Hadoop
CSNT '14: Proceedings of the 2014 Fourth International Conference on Communication Systems and Network TechnologiesBig Data is a huge amount of data that cannot be managed by the traditional data management system. Hadoop is a technological answer to Big Data. Hadoop Distributed File System (HDFS) and MapReduce programming model is used for storage and retrieval of ...
Efficient Batch Processing of Related Big Data Tasks using Persistent MapReduce Technique
VisionNet'16: Proceedings of the Third International Symposium on Computer Vision and the InternetThe data generated by today's enterprises has been increasing at exponential rates in size from most recent couple of years. Also, the need to process and break down the substantial volumes of data has likewise expanded. In order to handle this enormous ...
MapReduce: Review and open challenges
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Comments