Abstract
On-line decision augmentation (OLDA) has been considered as a promising paradigm for real-time decision making powered by Artificial Intelligence (AI). OLDA has been widely used in many applications such as real-time fraud detection, personalized recommendation, etc. On-line inference puts real-time features extracted from multiple time windows through a pre-trained model to evaluate new data to support decision making. Feature extraction is usually the most time-consuming operation in many OLDA data pipelines. In this work, we started by studying how existing in-memory databases can be leveraged to efficiently support such real-time feature extractions. However, we found that existing in-memory databases cost hundreds or even thousands of milliseconds. This is unacceptable for OLDA applications with strict real-time constraints. We therefore propose FEDB (<u>F</u>eature <u>E</u>ngineering <u>D</u>ata<u>b</u>ase), a distributed in-memory database system designed to efficiently support on-line feature extraction. Our experimental results show that FEDB can be one to two orders of magnitude faster than the state-of-the-art in-memory databases on real-time feature extraction. Furthermore, we explore the use of the Intel Optane DC Persistent Memory Module (PMEM) to make FEDB more cost-effective. When comparing the proposed PMEM-optimized persistent skiplist to the FEDB using DRAM+SSD, PMEM-based FEDB can shorten the tail latency up to 19.7%, reduce the recovery time up to 99.7%, and save up to 58.4% total cost of a real OLDA pipeline.
- Paul Alcorn. 2019. Intel Optane DIMM Pricing. https://www.tomshardware.com/news/intel-optane-dimm-pricing-performance,39007.html. Last accessed on 02-July-2020.Google Scholar
- Alibabacloud. 2019. Key Concepts and Features of Time Series Databases. https://www.alibabacloud.com/blog/key-concepts-and-features-of-time-series-databases_594734 Last accessed on 02-July-2020.Google Scholar
- Salem Alqahtani and Murat Demirbas. 2019. Performance Analysis and Comparison of Distributed Machine Learning Systems. arXiv:1909.02061 (2019).Google Scholar
- Mihnea Andrei, Christian Lemke, Günter Radestock, Robert Schulze, Carsten Thiel, Rolando Blanco, Akanksha Meghlan, Muhammad Sharique, Sebastian Seifert, Surendra Vishnoi, et al. 2017. SAP HANA adoption of non-volatile memory. Proceedings of the VLDB Endowment 10, 12 (2017), 1754--1765. Google ScholarDigital Library
- Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki. 2017. The case for heterogeneous HTAP. In 8th Biennial Conference on Innovative Data Systems Research.Google Scholar
- Jason Arnold, Boris Glavic, and loan Raicu. 2019. A High-Performance Distributed Relational Database System for Scalable OLAP Processing. In 2019 IEEE Int. Parallel and Distributed Processing Symposium (IPDPS). IEEE, 738--748.Google ScholarCross Ref
- Joy Arulraj. 2019. Data Management on Non-Volatile Memory. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). 1114. Google ScholarDigital Library
- Joy Arulraj, Justin Levandoski, Umar Farooq Minhas, and Per-Ake Larson. 2018. BzTree: A high-performance latch-free range index for non-volatile memory. Proceedings of the VLDB Endowment 11, 5 (2018), 553--565. Google ScholarDigital Library
- Joy Arulraj and Andrew Pavlo. 2017. How to build a non-volatile memory database management system. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). 1753--1758. Google ScholarDigital Library
- Joy Arulraj and Andrew Pavlo. 2017. How to Build a Non-Volatile Memory Database Management System. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). 1753--1758. https://db.cs.cmu.edu/papers/2017/p1753-arulraj.pdf Google ScholarDigital Library
- Joy Arulraj and Andrew Pavlo. 2019. Non-volatile memory database management systems. Synthesis Lectures on Data Management 11, 1 (2019), 1--191. Google ScholarDigital Library
- Andrew Chen, Andy Chow, Aaron Davidson, Arjun DCunha, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Clemens Mewald, Siddharth Murching, Tomas Nykodym, et al. 2020. Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning. 1--4. Google ScholarDigital Library
- Cheng Chen, Qingsong Wei, Weng-Fai Wong, and Chundong Wang. 2019. NV-Journaling: Locality-Aware Journaling Using Byte-Addressable Non-Volatile Memory. IEEE Trans. Comput. 69, 2 (2019), 288--299.Google ScholarDigital Library
- Cheng Chen, Jun Yang, Qingsong Wei, Chundong Wang, and Mingdi Xue. 2016. Fine-grained metadata journaling on NVM. In 2016 32nd Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1--13.Google ScholarCross Ref
- Shimin Chen and Qin Jin. 2015. Persistent b+-trees in non-volatile main memory. Proceedings of the VLDB Endowment 8, 7 (2015), 786--797. Google ScholarDigital Library
- Youmin Chen, Youyou Lu, Fan Yang, Qing Wang, Yang Wang, and Jiwu Shu. 2020. FlatStore: An Efficient Log-Structured Key-Value Storage Engine for Persistent Memory. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1077--1091. Google ScholarDigital Library
- Salvatore Sanfilippo et. al. 2009. Redis. https://redis.io/. Last accessed on 02-July-2020.Google Scholar
- Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, and Wolfgang Lehner. 2012. SAP HANA database: data management for modern business applications. ACM Sigmod Record 40, 4 (2012), 45--51. Google ScholarDigital Library
- Shen Gao, Bingsheng He, and Jianliang Xu. 2015. Real-Time In-Memory Check-pointing for Future Hybrid Memory Systems. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS '15). Association for Computing Machinery, New York, NY, USA, 263--272. Google ScholarDigital Library
- Google. 2019. In-Memory Database. https://cloud.google.com/blog/topics/partners/available-first-on-google-cloud-intel-optane-dc-persistent-memory Last accessed on 02-July-2020.Google Scholar
- The TANEJA Group. 2012. number of nines availability of systems. http://tanejagroup.com/files/Compellent_TG_Opinion_5_Nines_Sept_20121.pdf Last accessed on 02-July-2020.Google Scholar
- Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2020. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). 1035--1050. Google ScholarDigital Library
- Gui Huang, Xuntao Cheng, Jianying Wang, Yujie Wang, Dengcheng He, Tieying Zhang, Feifei Li, Sheng Wang, Wei Cao, and Qiang Li. 2019. X-Engine: An optimized storage engine for large-scale E-commerce transaction processing. In Proceedings of the 2019 Int. Conference on Management of Data. 651--665. Google ScholarDigital Library
- Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems.. In USENIX annual technical conference, Vol. 8. Boston, MA, USA. Google ScholarDigital Library
- IDC. 2019. IDC marketscape: Manufacturer evaluation of China machine learning development platform 2019. https://www.idc.com/getdoc.jsp?containerId=CHC45389019 Last accessed on 02-July-2020.Google Scholar
- Timescale Incorporated. 2019. TimescaleDB. https://github.com/timescale/timescaledb Last accessed on 02-July-2020.Google Scholar
- InfluxData. 2019. influxDB. https://www.influxdata.com/. Last accessed on 02-July-2020.Google Scholar
- Intel. 2015. Intel® OptaneTM DC persistent memory. "https://www.intel.com/content/www/us/en/architecture-and-technology/optane-dc-persistent-memory.html. Last accessed on 02-July-2020.Google Scholar
- Intel. 2015. Pmem.io. https://pmem.io/libpmemobj-cpp/, Last accessed on 26-January-2020.Google Scholar
- Intel. 2019. The Challenge of Keeping up with data. https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/optane-dc-persistent-memory-brief.pdf) Last accessed on 02-July-2020.Google Scholar
- Intel. 2019. Introduction to programming for persistent memory. https://github.com/pmemhackathon/2019-11-08/blob/master/PMEM_INTRO.pdf Last accessed on 02-July-2020.Google Scholar
- Intel. 2019. Ipmctl. https://github.com/intel/ipmctl. Last accessed on 02-July-2020.Google Scholar
- Intel. 2019. libpmemobj. https://github.com/pmem/libpmemobj-cpp/, Last accessed on 02-July-2020.Google Scholar
- Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R Dulloor, et al. 2019. Basic performance measurements of the intel optane DC persistent memory module. arXivpreprint arXiv:1903.05714 (2019).Google Scholar
- Tirthankar Lahiri, Shasank Chavan, Maria Colgan, Dinesh Das, Amit Ganesh, Mike Gleeson, Sanket Hase, Allison Holloway, Jesse Kamp, Teck-Hua Lee, et al. 2015. Oracle database in-memory: Adual format in-memory database. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 1253--1258.Google ScholarCross Ref
- Berti-Equille Laure, Bonifati Angela, and Milo Tova. 2018. Machine learning to data management: A round trip. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1735--1738.Google ScholarCross Ref
- Se Kwon Lee, K Hyun Lim, Hyunsub Song, Beomseok Nam, and Sam H Noh. 2017. {WORT}: Write Optimal Radix Tree for Persistent Memory Storage Systems. In 15th USENIX Conference on File and Storage Technologies (FAST 17). 257--270. Google ScholarDigital Library
- Yunseong Lee, Alberto Scolari, Byung-Gon Chun, Marco Domenico Santambrogio, Markus Weimer, and Matteo Interlandi. 2018. PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 611--626. Google ScholarDigital Library
- Viktor Leis, Kan Kundhikanjana, Alfons Kemper, and Thomas Neumann. 2015. Efficient Processing of Window Functions in Analytical SQL Queries. Proceedings of the VLDB Endowment 8, 10 (2015), 1058--1069. Google ScholarDigital Library
- Baotong Lu, Xiangpeng Hao, Tianzheng Wang, and Eric Lo. 2020. Dash: scalable hashing on persistent memory. arXiv preprint arXiv:2003.07302 (2020).Google Scholar
- Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018. Accelerating machine learning inference with probabilistic predicates. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). 1493--1508. Google ScholarDigital Library
- Darko Makreshanski, Jana Giceva, Claude Barthels, and Gustavo Alonso. 2017. BatchDB: Efficient isolated execution of hybrid OLTP + OLAP workloads for interactive applications. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). 37--50. Google ScholarDigital Library
- MemSQL. 2013. https://www.memsql.com/, Last accessed on 02-July-2020.Google Scholar
- Microsoft. Oct 30, 2018. Windows Server 2019 with Intel® OptaneTM DC persistent memory. https://techcommunity.microsoft.com/t5/Storage-at-Microsoft/The-new-HCI-industry-record-13-7-million-IOPS-with-Windows/ba-p/428314 Last accessed on 02-July-2020.Google Scholar
- MySQL. 1995. https://www.mysql.com/, Last accessed on 02-July-2020.Google Scholar
- OpenJDK. 2013. https://openjdk.java.net/projects/code-tools/jmh/, Last accessed on 02-July-2020.Google Scholar
- Oracle. Sep 16, 2019. Oracle Database with Intel Optane DC Persistent Memory. https://www.oracle.com/corporate/pressrelease/oow19-oracle-intel-partner-optane-exadata-091619.html Last accessed on 02-July-2020.Google Scholar
- Ismail Oukid, Johan Lasperas, Anisoara Nica, Thomas Willhalm, and Wolfgang Lehner. 2016. FPTree: A hybrid SCM-DRAM persistent and concurrent B-tree for storage class memory. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, 371--386. Google ScholarDigital Library
- Top percentile. 2019. TP-X. https://support.huaweicloud.com/intl/en-us/productdesc-apm/apm_06_0002.html. Last accessed on 02-July-2020.Google Scholar
- Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: a survey. ACM SIGMOD Record 47, 2 (2018), 17--28. Google ScholarDigital Library
- Georgios Psaropoulos, Ismail Oukid, Thomas Legler, Norman May, and Anastasia Ailamaki. 2019. Bridging the latency gap between NVM and DRAM for latency-bound operations. In Proceedings of the 15th International Workshop on Data Management on New Hardware. ACM. Google ScholarDigital Library
- Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Eric Chung, Bill Dally, et al. 2019. SysML: The New Frontier of Machine Learning Systems. (2019).Google Scholar
- Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment 11, 3, 269. Google ScholarDigital Library
- Aunn Raza, Periklis Chrysogelos, Angelos Christos Anadiotis, and Anastasia Ailamaki. 2020. Adaptive HTAP through Elastic Resource Scheduling. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). 2043--2054. Google ScholarDigital Library
- Theodoras Rekatsinas, SudeepaRoy, Manasi Vartak, Ce Zhang, and Neoklis Polyzotis. 2019. Opportunities for data management research in the era of horizontal AI/ML. Proceedings of the VLDB Endowment 12, 12 (2019), 2323--2323. Google ScholarDigital Library
- Babak Salimi, Corey Cole, Peter Li, Johannes Gehrke, and Dan Suciu. 2018. HypDB: a demonstration of detecting, explaining and resolving bias in OLAP queries. Proceedings of the VLDB Endowment 11, 12 (2018), 2062--2065. Google ScholarDigital Library
- Maximilian Schleich, Dan Olteanu, Mahmoud Abo Khamis, Hung Q Ngo, and XuanLong Nguyen. 2019. A layered aggregate engine for analytics workloads. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). 1642--1659. Google ScholarDigital Library
- Ji Sun, Zeyuan Shang, Guoliang Li, Dong Deng, and Zhifeng Bao. 2017. Dima: A distributed in-memory similarity-based query processing system. Proceedings of the VLDB Endowment 10, 12 (2017), 1925--1928. Google ScholarDigital Library
- Tracy Tsai. 2019. Competitive Landscape: AI Startups in China. Technical Report. Stamford, USA.Google Scholar
- Alexander van Renen, Viktor Leis, Alfons Kemper, Thomas Neumann, Takushi Hashida, Kazuichi Oe, Yoshiyasu Doi, Lilian Harada, and Mitsuru Sato. 2018. Managing Non-Volatile Memory in Database Systems. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). ACM, New York, NY, USA, 1541--1555. Google ScholarDigital Library
- Manasi Vartak, Joana M F. da Trindade, Samuel Madden, and Matei Zaharia. 2018. Mistique: A system to store and query model intermediates for model diagnosis. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). 1285--1300. Google ScholarDigital Library
- Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, Roy H Campbell, et al. 2011. Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory.. In FAST, Vol. 11. 61--75. Google ScholarDigital Library
- Chundong Wang, Sudipta Chattopadhyay, and Gunavaran Brihadiswarn. 2019. Crash recoverable ARMv8-oriented B+-tree for byte-addressable persistent memory. In Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. 33--44. Google ScholarDigital Library
- Chundong Wang, Qingsong Wei, Lingkun Wu, Sibo Wang, Cheng Chen, Xiaokui Xiao, Jun Yang, Mingdi Xue, and Yechao Yang. 2018. Persisting RB-tree into NVM in a consistency perspective. ACM Trans. on Storage (TOS) 14, 1 (2018), 1--27. Google ScholarDigital Library
- Tianzheng Wang, Justin Levandoski, and Per-Ake Larson. 2018. Easy lock-free indexing in non-volatile memory. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 461--472.Google ScholarCross Ref
- Wikipedia. 2019. Compare-and-swap. https://en.wikipedia.org/wiki/Compare-and-swap Last accessed on 02-July-2020.Google Scholar
- Wikipedia. 2019. LLVM. https://en.wikipedia.org/wiki/Click-through_rate Last accessed on 02-July-2020.Google Scholar
- Jian Xu and Steven Swanson. 2016. {NOVA}: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories. In 14th USENIX Conference on File and Storage Technologies ({FAST} 16). 323--338. Google ScholarDigital Library
- Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. 2015. NV-Tree: reducing consistency cost for NVM-based single level systems. In 13th USENIX Conference on File and Storage Technologies (FAST 15). 167--181. Google ScholarDigital Library
- Luo Yuanfei, Wang Mengshuo, Zhou Hao, Yao Quanming, Tu WeiWei, Chen Yuqiang, Yang Qiang, and Dai Wenyuan. 2019. AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications. arXiv preprint arXiv:1904.12857 (2019).Google Scholar
- Chaoqun Zhan, Maomeng Su, Chuangxian Wei, Xiaoqiang Peng, Liang Lin, Sheng Wang, Zhe Chen, Feifei Li, Yue Pan, Fang Zheng, et al. 2019. AnalyticDB: real-time OLAP database system at Alibaba cloud. Proceedings of the VLDB Endowment 12, 12 (2019), 2059--2070. Google ScholarDigital Library
- Yu Zhang, Shan Wang, Jiaheng Lu, et al. 2018. Fusion OLAP: Fusing the Pros of MOLAP and ROLAP Together for In-memory OLAP. IEEE Transactions on Knowledge and Data Engineering 31, 9 (2018), 1722--1735.Google ScholarDigital Library
Recommendations
Failure-Atomic Slotted Paging for Persistent Memory
Asplos'17The slotted-page structure is a database page format commonly used for managing variable-length records. In this work, we develop a novel "failure-atomic slotted page structure" for persistent memory that leverages byte addressability and durability of ...
System-level impacts of persistent main memory using a search engine
Computer memory systems traditionally use distinct technologies for different hierarchy levels, typically volatile, high speed, high cost/byte solid state memory for caches and main memory (SRAM and DRAM), and non-volatile, low speed, low cost/byte ...
Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators
In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural ...
Comments