Data Model for Analysis of Scholarly Documents in the MapReduce Paradigm

Kawa, Adam; Bolikowski, Łukasz; Czeczko, Artur; Dendek, Piotr Jan; Tkaczyk, Dominika

doi:10.1007/978-3-642-35647-6_12

Adam Kawa⁶,
Łukasz Bolikowski⁶,
Artur Czeczko⁶,
Piotr Jan Dendek⁶ &
…
Dominika Tkaczyk⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 467))

1027 Accesses
4 Citations

Abstract

At CeON ICM UW we are in possession of a large collection of scholarly documents that we store and process using MapReduce paradigm. One of the main challenges is to design a simple, but effective data model that fits various data access patterns and allows us to perform diverse analysis efficiently. In this paper, we will describe the organization of our data and explain how this data is accessed and processed by open-source tools from Apache Hadoop Ecosystem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Scholarly Resources Structuring: Use Cases for Digital Libraries

Analysing and Discovering Semantic Relations in Scholarly Data

An Ecosystem for Linked Humanities Data

References

Apache Hadoop - framework for the distributed processing of large data sets across clusters of computers using simple programming models, http://hadoop.apache.org
Apache HBase - scalable, distributed database that supports structured data storage for large tables, http://hbase.apache.org
Apache Hive - data warehouse infrastructure that provides data summarization and ad hoc querying, http://hive.apache.org
Apache Pig - high-level data-flow language and execution framework for parallel computation, http://pig.apache.org
Apache Thrift - software framework, for scalable cross-language services development, http://thrift.apache.org
Centre for Open Science in Interdisciplinary Centre for Mathematical and Computational Modelling (CeON ICM), University of Warsaw, http://ceon.pl/en
Cloudera’s Distribution including Apache Hadoop (CDC), http://www.cloudera.com/hadoop
Efficient method for loading large amounts of data into HBase table, http://hbase.apache.org/book.html#arch.bulk.load
Erdos Number, http://en.wikipedia.org/wiki/Erd%C5%91s_number
HBASE-3996 Jira Issue, https://issues.apache.org/jira/browse/HBASE-3996
HBase module for an integration with Hive, https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
HDFS - distributed file system that provides high-throughput access to application data, http://hadoop.apache.org/hdfs
Lempel-Ziv-Oberhumer (LZO) - lossless data compression algorithm that is focused on decompression speed, http://wiki.apache.org/hadoop/UsingLzoCompression
Merge Join in Apache Pig, http://pig.apache.org/docs/r0.10.0/perf.html#merge-joins
PIG-2673 Jira Issue, https://issues.apache.org/jira/browse/PIG-2673
PigMix - set of queries used test Apache Pig performance from release to release, https://cwiki.apache.org/confluence/display/PIG/PigMix
Resource Description Framework (RDF) - standard model for data interchange on the Web, http://www.w3.org/RDF
SPARQL - query language for RDF, http://www.w3.org/TR/rdf-sparql-query
Specialized join implementations in Apache Pig, http://pig.apache.org/docs/r0.10.0/perf.html#specialized-joins
The Lehigh University Benchmark (LUBM) that facilitates the evaluation of Semantic Web repositories in a standard and systematic way, http://swat.cse.lehigh.edu/projects/lubm
The SP²Bench SPARQL Performance Benchmark, http://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B
Lessons learned from OpenTSDB (2012), http://www.cloudera.com/resource/hbasecon-2012-lessons-learned-from-opentsdb
Abadi, D.J.: Hadoop’s tremendous inefficiency on graph data management (and how to avoid it) (2011), http://dbmsmusings.blogspot.com/2011/07/hadoops-tremendous-inefficiency-on.html
Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable semantic web data management using vertical partitioning. In: VLDB, pp. 411–422 (2007), http://dl.acm.org/citation.cfm?id=1325900
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51(1), 1–13 (2004)
Google Scholar
Demirbas, M.: Scalable SPARQL Querying of Large RDF Graphs (2011), http://muratbuffalo.blogspot.com/2011/12/scalable-sparql-querying-of-large-rdf.html
Gates, A.: Programming Pig. O’Reilly Media (2011)
Google Scholar
George, L.: HBase: The Definitive Guide. O’Reilly Media (2011)
Google Scholar
Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF Graphs. VLDB Endowment 4 (2011)
Google Scholar
Jurkiewicz, J., Nowiński, A.: Detailed Presentation versus Ease of Search — Towards the Universal Format of Bibliographic Metadata. Case Study of Dealing with Different Metadata Kinds during Import to Virtual Library of Science. In: García-Barriocanal, E., Cebeci, Z., Okur, M.C., Öztürk, A. (eds.) MTSR 2011. CCIS, vol. 240, pp. 186–193. Springer, Heidelberg (2011)
Chapter Google Scholar
Khadilkar, V., Kantarcioglu, M., Thuraisingham, B., Castagna, P.: Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store. Tech. rep. (2012), http://www.utdallas.edu/~vvk072000/Research/Jena-HBase-Ext/tech-report.pdf
Lipcon, T.: Is there a limit to the number of columns in an HBase row? (2011), http://www.quora.com/Is-there-a-limit-to-the-number-of-columns-in-an-HBase-row
Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N.: H2RDF: Adaptive Query Processing on RDF Data in the Cloud. In: Proceedings of the 21st International Conference on World Wide Web (WWW Demo Track), pp. 397–400 (2012)
Google Scholar
Rohloff, K., Schantz, R.: Clause-Iteration with MapReduce to Scalably Query Data Graphs in the SHARD Graph-Store. In: Proceedings of the Fourth International Workshop on Data-intensive Distributed Computing, pp. 35–44. ACM (2011), http://www.dist-systems.bbn.com/people/krohloff/papers/2011/Rohloff_Schantz_DIDC_2011.pdf
Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: Mapping SPARQL to Pig Latin. In: 3rd International Workshop on Semantic Web Information Management (SWIM 2011), in conjunction with the 2011 ACM International Conference on Management of Data (SIGMOD 2011), Athens, Greece (2011), http://www.informatik.uni-freiburg.de/~schaetzl/papers/PigSPARQL_SWIM2011.pdf
Sun, J., Jin, Q.: Scalable RDF Store based on HBase. In: 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE), pp. 633–636 (2010)
Google Scholar
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. In: Proceedings of the VLDB Endowment, pp. 1008–1019 (2008), http://dl.acm.org/citation.cfm?id=1453965
Wilkinson, K.: Jena Property Table Implementation. Tech. rep. (2006)
Google Scholar
Wilkinson, K., Sayers, C., Kuno, H., Reynolds, D.: Efficient RDF Storage and Retrieval in Jena2. Tech. rep. (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw, Poland
Adam Kawa, Łukasz Bolikowski, Artur Czeczko, Piotr Jan Dendek & Dominika Tkaczyk

Authors

Adam Kawa
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Bolikowski
View author publications
You can also search for this author in PubMed Google Scholar
Artur Czeczko
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Jan Dendek
View author publications
You can also search for this author in PubMed Google Scholar
Dominika Tkaczyk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Kawa .

Editor information

Editors and Affiliations

, Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, Warszawa, 00-665, Poland
Robert Bembenik
, Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, Warszawa, 00-665, Poland
Lukasz Skonieczny
, Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, Warszawa, 00-665, Poland
Henryk Rybinski
, Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, Warszawa, 00-665, Poland
Marzena Kryszkiewicz
, Interdisciplinary Centre for, University of Warsaw, Pawińskiego 5a bl. D, Warsaw, 02-106, Poland
Marek Niezgodka

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kawa, A., Bolikowski, Ł., Czeczko, A., Dendek, P.J., Tkaczyk, D. (2013). Data Model for Analysis of Scholarly Documents in the MapReduce Paradigm. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol 467. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35647-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-35647-6_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35646-9
Online ISBN: 978-3-642-35647-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Data Model for Analysis of Scholarly Documents in the MapReduce Paradigm

Abstract

Access this chapter

Preview

Similar content being viewed by others

Scholarly Resources Structuring: Use Cases for Digital Libraries

Analysing and Discovering Semantic Relations in Scholarly Data

An Ecosystem for Linked Humanities Data

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Data Model for Analysis of Scholarly Documents in the MapReduce Paradigm

Abstract

Access this chapter

Preview

Similar content being viewed by others

Scholarly Resources Structuring: Use Cases for Digital Libraries

Analysing and Discovering Semantic Relations in Scholarly Data

An Ecosystem for Linked Humanities Data

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation