Skip to main content

An Optimized NL2SQL System for Enterprise Data Mart

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track (ECML PKDD 2021)

Abstract

Natural language interfaces to databases is a growing field that enables end users to interact with relational databases without technical database skills. These interfaces solve the problem of synthesizing SQL queries based on natural language input from the user. There are considerable research interests around the topic but there are few systems to date that are deployed on top of an active enterprise data mart. We present our NL2SQL system designed for the banking sector, which can generate a SQL query from a user’s natural language question. The system is comprised of the NL2SQL model we developed, as well as the data simulation and the adaptive feedback framework to continuously improve model performance. The architecture of this NL2SQL model is built on our research on WikiSQL data, which we extended to support multitable scenarios via our unique table expand process. The data simulation and the feedback loop help the model continuously adjust to linguistic variation introduced by the domain specific knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Androutsopoulos, I., Ritchie, G.D., Thanisch, P.: Natural language interfaces to databases - an introduction. CoRR cmp-lg/9503016 (1995). http://arxiv.org/abs/cmp-lg/9503016

  2. Aunalytics: Dayreak analytic database. https://www.aunalytics.com/products/daybreak/

  3. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805

  4. Dhamdhere, K., McCurley, K.S., Nahmias, R., Sundararajan, M., Yan, Q.: Analyza: exploring data with conversation. In: Proceedings of the 22nd International Conference on Intelligent User Interfaces, pp. 493–504. IUI 2017. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3025171.3025227, https://doi.org/10.1145/3025171.3025227

  5. Dong, L., Lapata, M.: Coarse-to-fine decoding for neural semantic parsing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 731–742. Association for Computational Linguistics, Melbourne, Australia. July 2018. https://doi.org/10.18653/v1/P18-1068, https://www.aclweb.org/anthology/P18-1068

  6. Elastic: Elasticsearch. https://www.elastic.co/enterprise-search

  7. Facebook: Duckling. https://duckling.wit.ai/

  8. Hwang, W., Yim, J., Park, S., Seo, M.: A comprehensive exploration on WikiSQL with table-aware word contextualization. CoRR abs/1902.01069 (2019). http://arxiv.org/abs/1902.01069

  9. Inmon, B.: Data mart does not equal data warehouse (1999)

    Google Scholar 

  10. Iyer, S., Konstas, I., Cheung, A., Krishnamurthy, J., Zettlemoyer, L.: Learning a neural semantic parser from user feedback. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 963–973. Association for Computational Linguistics, Vancouver, Canada, July 2017. https://doi.org/10.18653/v1/P17-1089, https://www.aclweb.org/anthology/P17-1089

  11. Janai, J., Güney, F., Behl, A., Geiger, A.: Computer vision for autonomous vehicles: problems, datasets and state of the art. Foundations Trends® Comput. Graph. Vis. 12(1–3), 1–308 (2020). https://doi.org/10.1561/0600000079, http://dx.doi.org/10.1561/0600000079

  12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, 7–9 May 2015, San Diego, CA, USA, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980

  13. Kurita, K., Vyas, N., Pareek, A., Black, A.W., Tsvetkov, Y.: Measuring bias in contextualized word representations. In: Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp. 166–172. Association for Computational Linguistics, Florence, Italy, August 2019. https://doi.org/10.18653/v1/W19-3823, https://www.aclweb.org/anthology/W19-3823

  14. Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions. Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966)

    Google Scholar 

  15. Li, F., Jagadish, H.V.: NaLIR: an interactive natural language interface for querying relational databases. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 709–712. SIGMOD 2014. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2588555.2594519, https://doi.org/10.1145/2588555.2594519

  16. Lin, X.V., Socher, R., Xiong, C.: Bridging textual and tabular data for cross-domain text-to-SQL semantic parsing. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4870–4888. Association for Computational Linguistics, Online, November 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.438, https://www.aclweb.org/anthology/2020.findings-emnlp.438

  17. Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60. Association for Computational Linguistics, Baltimore, Maryland, June 2014. https://doi.org/10.3115/v1/P14-5010, https://www.aclweb.org/anthology/P14-5010

  18. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001). https://doi.org/10.1145/375360.375365, https://doi.org/10.1145/375360.375365

  19. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162

  20. Peterson, S.: Stars: A pattern language for query optimized schema (1994). http://c2.com/ppr/stars.html

  21. Setlur, V., Battersby, S.E., Tory, M., Gossweiler, R., Chang, A.X.: Eviza: a natural language interface for visual analysis. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology, pp. 365–377. UIST 2016. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2984511.2984588, https://doi.org/10.1145/2984511.2984588

  22. Setlur, V., Tory, M., Djalali, A.: Inferencing underspecified natural language utterances in visual analysis. In: Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 40–51. IUI 2019. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3301275.3302270, https://doi.org/10.1145/3301275.3302270

  23. Shen, D., Wu, G., Suk, H.I.: Deep learning in medical image analysis. Ann. Rev. Biomed. Eng. 19(1), 221–248 (2017). https://doi.org/10.1146/annurev-bioeng-071516-044442, https://doi.org/10.1146/annurev-bioeng-071516-044442, pMID: 28301734

  24. Sun, T., et al.: Mitigating gender bias in natural language processing: literature review. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1630–1640. Association for Computational Linguistics, Florence, Italy, July 2019. https://doi.org/10.18653/v1/P19-1159, https://www.aclweb.org/anthology/P19-1159

  25. Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762

  26. Wang, B., Shin, R., Liu, X., Polozov, O., Richardson, M.: RAT-SQL: relation-aware schema encoding and linking for text-to-sql parsers. CoRR abs/1911.04942 (2019). http://arxiv.org/abs/1911.04942

  27. Wang, P., Shi, T., Reddy, C.K.: Text-to-SQL generation for question answering on electronic medical records. In: Huang, Y., King, I., Liu, T., van Steen, M. (eds.) WWW 2020: The Web Conference 2020, 20–24 April 2020, Taipei, Taiwan, pp. 350–361. ACM/IW3C2 (2020). https://doi.org/10.1145/3366423.3380120, https://doi.org/10.1145/3366423.3380120

  28. Weir, N., et al.: DBPal: a fully pluggable NL2SQL training pipeline. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 2347–2361. SIGMOD 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3318464.3380589, https://doi.org/10.1145/3318464.3380589

  29. Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989). https://doi.org/10.1162/neco.1989.1.2.270

    Article  Google Scholar 

  30. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. CoRR abs/1910.03771 (2019). http://arxiv.org/abs/1910.03771

  31. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144

  32. Xu, X., Liu, C., Song, D.: SQLNet: generating structured queries from natural language without reinforcement learning. CoRR abs/1711.04436 (2017). http://arxiv.org/abs/1711.04436

  33. Yu, T., et al.: Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. CoRR abs/1809.08887 (2018). http://arxiv.org/abs/1809.08887

  34. Zeng, J., et al.: Photon: A robust cross-domain Text-to-SQL system. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 204–214. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-demos.24, https://www.aclweb.org/anthology/2020.acl-demos.24

  35. Zhong, V., Lewis, M., Wang, S.I., Zettlemoyer, L.: Grounded adaptation for zero-shot executable semantic parsing (2021)

    Google Scholar 

  36. Zhong, V., Xiong, C., Socher, R.: Seq2SQL: generating structured queries from natural language using reinforcement learning. CoRR abs/1709.00103 (2017). http://arxiv.org/abs/1709.00103

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kaiwen Dong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dong, K., Lu, K., Xia, X., Cieslak, D., Chawla, N.V. (2021). An Optimized NL2SQL System for Enterprise Data Mart. In: Dong, Y., Kourtellis, N., Hammer, B., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12979. Springer, Cham. https://doi.org/10.1007/978-3-030-86517-7_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86517-7_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86516-0

  • Online ISBN: 978-3-030-86517-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics