Skip to main content
Log in

A hybrid approach of Poisson distribution LDA with deep Siamese Bi-LSTM and GRU model for semantic similarity prediction for text data

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Prediction of semantic similarity between text data is an open and challenging research issue in the NLP-Natural Language-processing field. Traditional semantic text-similarity techniques capturing text lexical features neglect syntactic and semantic text properties and are exhibited with higher dimensions of feature vectors. To overcome these issues, the present study aims to develop a hybrid approach integrating Deep Siamese Bi-LSTM-Bidirectional Long-short term Memory network and GRU-Gated Recurrent-Unit neural network training model. The proposed model is employed in the weight estimation of vectors and minimizing feature vector dimension before the training phases. Initially, Pre-processing phase, eliminates special characters from text form, converting them to feature vectors through vectorization and weight values are updated using Weighted TF-IDF-Term Frequency Inverse-Document Frequency aided by the log-likelihood Weight calculation method. The Poisson Normal LDA-Linear-discriminant analysis technique reduced the dimensions of the feature vector. Such embedded vectors as weight values are fed into the training model, wherein the trained model estimates similarity scores of input data and performs text classification using Deep Siamese Bi-LSTM and GRU classifiers. The proposed model undergoes performance assessment by attaining 19% improved accuracy rate by using STS Dataset than the existing methods. The model also showed better results for the other datasets. The higher accuracy and F1 score elucidated the efficiency of the proposed framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

My Manuscript has no associated data.

References

  1. Araque O, Zhu G, Iglesias CA (2019) A semantic similarity-based perspective of affect lexicons for sentiment analysis. Knowl-Based Syst 165:346–359

    Article  Google Scholar 

  2. Avasthi S, Chauhan R, Acharjya DP (2021) Techniques, applications, and issues in mining large-scale text databases, in Advances in Information Communication Technology and Computing, ed: Springer, pp 385–396

  3. Avasthi S, Chauhan R, Acharjya DP (2021) Processing large text corpus using N-gram language modeling and smoothing, in Proceedings of the Second International Conference on Information Management and Machine Intelligence, pp 21–32

  4. Avasthi S, Chauhan R, Acharjya DP (2022) Information Extraction and Sentiment Analysis to gain insight into the COVID-19 crisis, in International Conference on Innovative Computing and Communications, pp 343–353

  5. Bhatti UA, Huang M, Wu D, Zhang Y, Mehmood A, Han H (2019) Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterpr Inf Syst 13:329–351

    Article  Google Scholar 

  6. Bhatti UA, Yu Z, Yuan L, Zeeshan Z, Nawaz SA, Bhatti M, Mehmood A, Ain QU, Wen L (2020) Geometric algebra applications in geospatial artificial intelligence and remote sensing image processing. IEEE Access 8:155783–155796

    Article  Google Scholar 

  7. Bhatti UA, Yu Z, Chanussot J, Zeeshan Z, Yuan L, Luo W et al (2021) Local similarity-based spatial–spectral fusion hyperspectral image classification with deep CNN and Gabor filtering. IEEE Trans Geosci Remote Sens 60:1–15

    Article  Google Scholar 

  8. Bhatti UA, Ming-Quan Z, Qing-Song H, Ali S, Hussain A, Yuhuan Y et al (2021) Advanced color edge detection using Clifford algebra in satellite images. IEEE Photonics J 13:1–20

    Article  Google Scholar 

  9. Bhatti UA, Zeeshan Z, Nizamani MM, Bazai S, Yu Z, Yuan L (2022) Assessing the change of ambient air quality patterns in Jiangsu Province of China pre-to post-COVID-19. Chemosphere 288:132569

    Article  Google Scholar 

  10. Biçici E (2022) Machine translation performance prediction system: optimal prediction for optimal translation. SN Comput Sci 3:1–23

    Article  Google Scholar 

  11. Bollegala D, Kiryo R, Tsujino K, Yukawa H (2020) Language-independent tokenisation rivals language-specific tokenisation for word similarity prediction, arXiv preprint arXiv:2002.11004

  12. Camacho-Collados J, Pilehvar MT (2017) On the role of text preprocessing in neural network architectures: An evaluation study on text categorization and sentiment analysis, arXiv preprint arXiv:1707.01780

  13. Choi H, Lee H (2019) Multitask learning approach for understanding the relationship between two sentences. Inf Sci 485:413–426

    Article  Google Scholar 

  14. Dias L, Gerlach M, Scharloth J, Altmann EG (2018) Using text analysis to quantify the similarity and evolution of scientific disciplines. R Soc Open Sci 5:171545

    Article  Google Scholar 

  15. Gudakahriz SJ, Moghadam AME, Mahmoudi F (2020) An experimental study on performance of text representation models for sentiment analysis. Inf Syst Telecommun:45–52. https://doi.org/10.7508/jist.2020.01.005

  16. Guo J, Wu B, Zhou P (2020) BLHNN: A Novel Charge Prediction Model Based on Bi-Attention LSTM-CNN Hybrid Neural Network, in 2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC), pp 246–252

  17. Hu P, Peng D, Sang Y, Xiang Y (2019) Multi-view linear discriminant analysis network. IEEE Trans Image Process 28:5352–5365

    Article  MathSciNet  MATH  Google Scholar 

  18. Huang X, Wu L, Ye Y (2019) A review on dimensionality reduction techniques. Int J Pattern Recognit Artif Intell 33:1950017

    Article  Google Scholar 

  19. Jabri S, Dahbi A, Gadi T, Bassir A (2018) Ranking of text documents using TF-IDF weighting and association rules mining, in 2018 4th international conference on optimization and applications (ICOA), pp 1–6

  20. Kumar CP, Babu LD (2019) Novel text preprocessing framework for sentiment analysis. In: Smart Intelligent Computing and Applications. ed: Springer, pp 309–317

    Chapter  Google Scholar 

  21. Li X, Yao C, Zhang Q, Zhang G (2019) Semantic similarity modeling based on multi-granularity interaction matching. Int J Innov Comput Inf Control 15:1685–1700

    Google Scholar 

  22. Li X, Zeng F, Yao C (2020) A semi-supervised paraphrase identification model based on multi-granularity interaction reasoning. IEEE Access 8:60790–60800

    Article  Google Scholar 

  23. Liu Y, Li K, Yan D, Gu S (2022) A network-based CNN model to identify the hidden information in text data. Phys A: Stat Mech Appl 590:126744

    Article  Google Scholar 

  24. Luo L-x (2019) Network text sentiment analysis method combining LDA text representation and GRU-CNN. Pers Ubiquit Comput 23:405–412

    Article  Google Scholar 

  25. Ma J, Guo X, Zhao X (2022) Identifying purchase intention through deep learning: analyzing the Q & D text of an E-commerce platform. Ann Oper Res:1–20

  26. Mahmoud A, Zrigui M (2021) BLSTM-API: bi-LSTM recurrent neural network-based approach for Arabic paraphrase identification. Arab J Sci Eng 46:4163–4174

    Article  Google Scholar 

  27. Meenakshi D, Shanavas ARM (2022) Transformer induced enhanced feature engineering for contextual similarity detection in text. Bull Electr Eng Inform 11:2124–2130

    Article  Google Scholar 

  28. Nanda R, Siragusa G, Di Caro L, Boella G, Grossio L, Gerbaudo M et al (2019) Unsupervised and supervised text similarity systems for automated identification of national implementing measures of European directives. Artif Intell Law 27:199–225

    Article  Google Scholar 

  29. Othman N, Faiz R, Smaïli K (2022) Learning English and Arabic question similarity with Siamese neural networks in community question answering services. Data Knowl Eng 138:101962

    Article  Google Scholar 

  30. Prasetya DD, Wibawa AP, Hirashima T (2018) The performance of text similarity algorithms. Int J Adv Intell Inform 4:63–69

    Article  Google Scholar 

  31. Rahim MMAA (2021) Measuring semantic similarity for Arabic sentences using machine learning, Princess Sumaya University for technology (Jordan)

  32. Roul RK, Sahoo JK, Arora K (2017) Modified TF-IDF term weighting strategies for text categorization, in 2017 14th IEEE India council international conference (INDICON), pp 1–6

  33. Sarwar TB, Noor NM, Miah MSU (2022) Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding. PeerJ Computer Science 8:e1024

    Article  Google Scholar 

  34. Shihab MSH, Aditya S, Setu JH, Imtiaz-Ud-Din K, Efat MIA (2020) A Hybrid GRU-CNN Feature Extraction Technique for Speaker Identification, in 2020 23rd International Conference on Computer and Information Technology (ICCIT), pp 1–6

  35. Singh AK, Shashi M (2019) Vectorization of text documents for identifying unifiable news articles. Int J Adv Comput Sci Appl 10. https://doi.org/10.14569/IJACSA.2019.0100742

  36. Soğancıoğlu G, Öztürk H, Özgür A (2017) BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33:i49–i58

    Article  Google Scholar 

  37. Song H-J, Heo T-S, Kim J-D, Park C-Y, Kim Y-S (2021) Sentence similarity evaluation using Sent2Vec and siamese neural network with parallel structure. J Intell Fuzzy Syst:1–10

  38. Sravanthi P, Srinivasu B (2017) Semantic similarity between sentences. Int Res J Eng Technol (IRJET) 4:156–161

    Google Scholar 

  39. Sun F, Chen H (2018) Feature extension for chinese short text classification based on LDA and word2vec, in 2018 13th IEEE Conference on Industrial Electronics and Applications (ICIEA), pp 1189–1194

  40. Tao J, Jia L, Wan MC, Meng JH (2020) The Text modeling method of Tibetan text combining Word2vec and improved TF-IDF. J Phys Conf Ser 1601:042007

    Article  Google Scholar 

  41. Tien NH, Le NM, Tomohiro Y, Tatsuya I (2019) Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity. Inf Process Manag 56:102090

    Article  Google Scholar 

  42. Tomer M, Kumar M (2020) Improving text summarization using Ensembled approach based on fuzzy with LSTM. Arab J Sci Eng 45:10743–10754

    Article  Google Scholar 

  43. Townes FW, Hicks SC, Aryee MJ, Irizarry RA (2019) Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol 20(295):1–16. https://doi.org/10.1186/s13059-019-1861-6

  44. Vekariya DV, Limbasiya NR (2020) A novel approach for semantic similarity measurement for high quality answer selection in question answering using deep learning methods, in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), pp 518–522

  45. Wu J, Huang C, Chen Y (2020) Patent Text Classification Study Based on Bi-LSTM-A Model, in 2020 5th international conference on control, Robotics and Cybernetics (CRC), pp 1–5

  46. Xiong C-z, Su M (2019) IARNN-based semantic-containing double-level embedding Bi-LSTM for question-and-answer matching. Comput Intell Neurosci 2019:1–10

  47. Xu G, Wu X, Yao H, Li F, Yu Z (2019) Research on topic recognition of network sensitive information based on SW-LDA model. IEEE Access 7:21527–21538

    Article  Google Scholar 

  48. Xu C, Wang H, Wu S, Lin Z (2021) Tag-enhanced dynamic compositional neural network over arbitrary tree structure for sentence representation. Expert Syst Appl 181:115182

    Article  Google Scholar 

  49. Yang Y, Yuan S, Cer D, Kong S-y, Constant N, Pilar P et al (2018) Learning semantic textual similarity from conversations, arXiv preprint arXiv:1804.07754

  50. Yang Z, Hu Z, Dyer C, Xing EP, Berg-Kirkpatrick T (2018) Unsupervised text style transfer using language models as discriminators. In: NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 7298–7309

  51. Yang Y, Wu B, Zhao K, Guo W (2020) Tweet stance detection: A two-stage DC-BILSTM model based on semantic attention, in 2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC), pp 22–29

  52. Yang J, Li Y, Gao C, Zhang Y (2021) Measuring the short text similarity based on semantic and syntactic information. Futur Gener Comput Syst 114:169–180

    Article  Google Scholar 

  53. Yu S, Liu D, Zhu W, Zhang Y, Zhao S (2020) Attention-based LSTM, GRU and CNN for short text classification. J Intell Fuzzy Syst 39:333–340

    Article  Google Scholar 

  54. Zhang Y, Lu W, Ou W, Zhang G, Zhang X, Cheng J et al (2019) Chinese medical question answer selection via hybrid models based on CNN and GRU. Multimed Tools Appl 79:1–26

    Google Scholar 

  55. Zhang X, Li P, Li H (2020) AMBERT: A pre-trained language model with multi-grained tokenization, arXiv preprint arXiv:2008.11869

  56. Zhang P, Huang X, Wang Y, Jiang C, He S, Wang H (2021) Semantic similarity computing model based on multi model fine-grained nonlinear fusion. IEEE Access 9:8433–8443

    Article  Google Scholar 

  57. Zheng T, Gao Y, Wang F, Fan C, Fu X, Li M et al (2019) Detection of medical text semantic similarity based on convolutional neural network. BMC Medical Inform Decis Mak 19:1–11

    Article  Google Scholar 

  58. Zhu Z, He Z, Tang Z, Wang B, Chen W (2018) A Semantic Similarity Computing Model based on Siamese Network for Duplicate Questions Identification, in CCKS Tasks, pp 44–51

  59. Zulqarnain M, Ghazali R, Ghouse MG, Mushtaq MF (2019) Efficient processing of GRU based on word embedding for text classification. JOIV: Int J Inform Visualization 3:377–383

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. Viji.

Ethics declarations

Conflict of interest

On behalf of all authors, I the Corresponding author report that there is no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Viji, D., Revathy, S. A hybrid approach of Poisson distribution LDA with deep Siamese Bi-LSTM and GRU model for semantic similarity prediction for text data. Multimed Tools Appl 82, 37221–37248 (2023). https://doi.org/10.1007/s11042-023-15050-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15050-4

Keywords

Navigation