Abstract
At CeON ICM UW we are in possession of a large collection of scholarly documents that we store and process using MapReduce paradigm. One of the main challenges is to design a simple, but effective data model that fits various data access patterns and allows us to perform diverse analysis efficiently. In this paper, we will describe the organization of our data and explain how this data is accessed and processed by open-source tools from Apache Hadoop Ecosystem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Apache Hadoop - framework for the distributed processing of large data sets across clusters of computers using simple programming models, http://hadoop.apache.org
Apache HBase - scalable, distributed database that supports structured data storage for large tables, http://hbase.apache.org
Apache Hive - data warehouse infrastructure that provides data summarization and ad hoc querying, http://hive.apache.org
Apache Pig - high-level data-flow language and execution framework for parallel computation, http://pig.apache.org
Apache Thrift - software framework, for scalable cross-language services development, http://thrift.apache.org
Centre for Open Science in Interdisciplinary Centre for Mathematical and Computational Modelling (CeON ICM), University of Warsaw, http://ceon.pl/en
Clouderaâs Distribution including Apache Hadoop (CDC), http://www.cloudera.com/hadoop
Efficient method for loading large amounts of data into HBase table, http://hbase.apache.org/book.html#arch.bulk.load
Erdos Number, http://en.wikipedia.org/wiki/Erd%C5%91s_number
HBASE-3996 Jira Issue, https://issues.apache.org/jira/browse/HBASE-3996
HBase module for an integration with Hive, https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
HDFS - distributed file system that provides high-throughput access to application data, http://hadoop.apache.org/hdfs
Lempel-Ziv-Oberhumer (LZO) - lossless data compression algorithm that is focused on decompression speed, http://wiki.apache.org/hadoop/UsingLzoCompression
Merge Join in Apache Pig, http://pig.apache.org/docs/r0.10.0/perf.html#merge-joins
PIG-2673 Jira Issue, https://issues.apache.org/jira/browse/PIG-2673
PigMix - set of queries used test Apache Pig performance from release to release, https://cwiki.apache.org/confluence/display/PIG/PigMix
Resource Description Framework (RDF) - standard model for data interchange on the Web, http://www.w3.org/RDF
SPARQL - query language for RDF, http://www.w3.org/TR/rdf-sparql-query
Specialized join implementations in Apache Pig, http://pig.apache.org/docs/r0.10.0/perf.html#specialized-joins
The Lehigh University Benchmark (LUBM) that facilitates the evaluation of Semantic Web repositories in a standard and systematic way, http://swat.cse.lehigh.edu/projects/lubm
The SP2Bench SPARQL Performance Benchmark, http://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B
Lessons learned from OpenTSDB (2012), http://www.cloudera.com/resource/hbasecon-2012-lessons-learned-from-opentsdb
Abadi, D.J.: Hadoopâs tremendous inefficiency on graph data management (and how to avoid it) (2011), http://dbmsmusings.blogspot.com/2011/07/hadoops-tremendous-inefficiency-on.html
Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable semantic web data management using vertical partitioning. In: VLDB, pp. 411â422 (2007), http://dl.acm.org/citation.cfm?id=1325900
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACMÂ 51(1), 1â13 (2004)
Demirbas, M.: Scalable SPARQL Querying of Large RDF Graphs (2011), http://muratbuffalo.blogspot.com/2011/12/scalable-sparql-querying-of-large-rdf.html
Gates, A.: Programming Pig. OâReilly Media (2011)
George, L.: HBase: The Definitive Guide. OâReilly Media (2011)
Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF Graphs. VLDB Endowment 4 (2011)
Jurkiewicz, J., NowiĆski, A.: Detailed Presentation versus Ease of Search â Towards the Universal Format of Bibliographic Metadata. Case Study of Dealing with Different Metadata Kinds during Import to Virtual Library of Science. In: GarcĂa-Barriocanal, E., Cebeci, Z., Okur, M.C., ĂztĂŒrk, A. (eds.) MTSR 2011. CCIS, vol. 240, pp. 186â193. Springer, Heidelberg (2011)
Khadilkar, V., Kantarcioglu, M., Thuraisingham, B., Castagna, P.: Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store. Tech. rep. (2012), http://www.utdallas.edu/~vvk072000/Research/Jena-HBase-Ext/tech-report.pdf
Lipcon, T.: Is there a limit to the number of columns in an HBase row? (2011), http://www.quora.com/Is-there-a-limit-to-the-number-of-columns-in-an-HBase-row
Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N.: H2RDF: Adaptive Query Processing on RDF Data in the Cloud. In: Proceedings of the 21st International Conference on World Wide Web (WWW Demo Track), pp. 397â400 (2012)
Rohloff, K., Schantz, R.: Clause-Iteration with MapReduce to Scalably Query Data Graphs in the SHARD Graph-Store. In: Proceedings of the Fourth International Workshop on Data-intensive Distributed Computing, pp. 35â44. ACM (2011), http://www.dist-systems.bbn.com/people/krohloff/papers/2011/Rohloff_Schantz_DIDC_2011.pdf
SchÀtzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: Mapping SPARQL to Pig Latin. In: 3rd International Workshop on Semantic Web Information Management (SWIM 2011), in conjunction with the 2011 ACM International Conference on Management of Data (SIGMOD 2011), Athens, Greece (2011), http://www.informatik.uni-freiburg.de/~schaetzl/papers/PigSPARQL_SWIM2011.pdf
Sun, J., Jin, Q.: Scalable RDF Store based on HBase. In: 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE), pp. 633â636 (2010)
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. In: Proceedings of the VLDB Endowment, pp. 1008â1019 (2008), http://dl.acm.org/citation.cfm?id=1453965
Wilkinson, K.: Jena Property Table Implementation. Tech. rep. (2006)
Wilkinson, K., Sayers, C., Kuno, H., Reynolds, D.: Efficient RDF Storage and Retrieval in Jena2. Tech. rep. (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Kawa, A., Bolikowski, Ć., Czeczko, A., Dendek, P.J., Tkaczyk, D. (2013). Data Model for Analysis of Scholarly Documents in the MapReduce Paradigm. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol 467. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35647-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-35647-6_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35646-9
Online ISBN: 978-3-642-35647-6
eBook Packages: EngineeringEngineering (R0)