A deep network model for paraphrase detection in short text messages
Introduction
Twitter has for some time been a popular means for expressing opinions about a variety of subjects. Paraphrase detection in user-generated noisy texts, such as Twitter texts,1 is an important task for various Natural Language Processing (NLP), information retrieval and text mining tasks, including query ranking, plagiarism detection, question answering, and document summarization. Recently, the paraphrase detection task has gained significant interest in applied NLP because of the need to deal with the pervasive problem of linguistic variation.
Paraphrase detection is an NLP classification problem. Given a pair of sentences, the system determines the semantic similarity between the two sentences. If the two sentences convey the same meaning, then it is labelled as paraphrase; otherwise, it is labeled as non-paraphrase. Most of the existing paraphrase systems have performed quite well on clean text corpora, such as the Microsoft Paraphrase Corpus (MSRP) (Dolan, Quirk, & Brockett, 2004). However, detecting paraphrases in user-generated noisy Tweets is more challenging due to issues like misspelling, acronyms, style and structure (Xu, Ritter, Callison-Burch, Dolan, & Ji, 2014). In addition, measuring the semantic similarity between two short sentences is very difficult due to the lack of common lexical features (Kajiwara, Bollegala, Yoshida, & Kawarabayashi, 2017). Although little attention has been given to paraphrase detection in noisy short-texts thus far, some initial work has been reported on the SemEval 2015 benchmark Twitter dataset (Dey, Shrivastava, Kaushik, 2016, Xu, Callison-Burch, Dolan, 2015, Xu, Ritter, Callison-Burch, Dolan, Ji, 2014). Unfortunately, the best performing approaches on one dataset doesn’t seem to perform as good when evaluated against another. As we discuss later in this paper, the state-of-the-art approach for the SemEval dataset proposed by Dey et al. (2016) does not have good performance (in form of F1-score) when evaluated on the MSRP dataset. Similarly, Ji and Eisenstein (2013) is the best performing approach on the MSRP dataset, but does not perform well on the SemEval dataset. In conclusion, existing approaches are not very generic, but rather are highly dependent on the data used for training.
Focusing on the problem discussed above, the main goal of this work is to develop a robust paraphrase detection model based on deep learning techniques that is able to successfully detect paraphrasing in both noisy and clean texts. More specifically, we propose a hybrid deep neural architecture composed by a convolutional neural network (CNN) and a recurrent neural network (RNN) model, further enhanced by a novel word-pair similarity module. The proposed paraphrase detection model is composed of two main components: (1) sentence modelling and (2) pair-wise word similarity matching. First, sentence modelling concerns building an effective model to represent the text. To do this, we build a joint CNN and RNN architecture that takes the local features extracted by the CNN as input to the RNN. We take word embeddings as input to the CNN model. Then, after convolutions and pooling operations, the encoded feature maps are taken in sequence as input to the RNN model. The last hidden state learned by the RNN model is considered as the sentence level representation. The main rationale behind using both CNN and RNN here is that the CNN is able to learn the local features in form of important n-grams of the texts; whereas RNN takes words in a sequential order and is able to learn the long-term dependencies of texts rather than local features. Second, a pair-wise similarity matching model is used to extract fine-grained similarity information between pairs of sentences. Initially, a pair-wise similarity matrix is constructed by computing the similarity of each word in a given sentence to all the words in another sentence. We then apply a CNN onto this similarity matrix to analyse the patterns in the semantic correspondence between each pair of words in the two sentences that are intuitively useful for paraphrase identification. The idea to apply convolutions over the similarity matrix to extract the important word-word similarity pairs is motivated by how convolutions over text can extract the most important parts of a sentence.
In this paper, we show how the proposed model for paraphrase detection can be enhanced by employing an extra set of statistical features extracted from the input text. To demonstrate its robustness, we evaluate the proposed approach and compare it with the state-of-the-art models, using two different datasets, covering both noisy user-generated texts – i.e., the SemEval 2015 benchmark Twitter dataset, and clean texts – i.e., the Microsoft Paraphrase Corpus (MSRP).
The main contributions of this work can be summarized as follows:
- 1.
We propose a novel deep neural network architecture leveraging coarse-grained sentence-level features and fine-grained word-level features for detecting paraphrases on noisy short text from Twitter. The model combines sentence-level and word-level semantic similarity information such that it can capture semantic information at each level. When the text is grammatically irregular or very short, the word-level similarity model can provide useful information; while the semantic representation of the sentence provide useful information otherwise. In this way both model-components complement each other and provide an efficient overall performance.
- 2.
We show how the proposed pair-wise similarity model can be used to extract word-level semantic information, and demonstrate its usefulness in the paraphrase detection task.
- 3.
We propose a method combining statistical textual features and features learned from the deep architecture.
- 4.
We present an extensive comparative study for the paraphrase detection problem.
The rest of the paper is organized as follows: In Section 2, we formally define the problem. In Section 3, we discuss related work concerning paraphrase detection. In Section 4, we motivate our work and present our proposed solution in detail. In Section 5, we describe the experimental setup. In Section 6, we evaluate the approach and discuss the results. Finally, in Section 7, we conclude the paper and outline plans for future research.
Section snippets
Problem statement and goals
Let S1 and S2 be two sentences, such that S1 ≠ S2. S1 and S2 are said to be paraphrased if they convey the same meaning and are semantically equivalent. Now, assume that we have a collection of N annotated sentence pairs ( ), having annotations ki, for i = . For a given i, ki indicates whether the ith sentence pair is paraphrased or non-paraphrased. The problem addressed in this paper is to develop a model, which can reliably label a previously unseen sentence pair as paraphrased
Related work
The use of deep neural network for natural language processing (NLP) has increased considerably over the recent years. Most of the previous work on paraphrase detection have focused on features like n-gram overlap features (Madnani, Tetreault, & Chodorow, 2012), syntax features (Das & Smith, 2009), linguistic features (Sahi, Gupta, 2017, Vani, Gupta, 2018) Wikipedia-based semantic networks (Jiang, Bai, Zhang, & Hu, 2017), knowledge graphs (Franco-Salvador, Rosso, & Montes-y Gómez, 2016) and
DeepParaphrase architecture
We propose a deep learning-based approach for detecting paraphrase sentences for Tweets, with the architecture depicted in Fig. 1. We first convert each sentence in a pair into a semantic representative vector, using a CNN and an RNN. Then, a semantic pair-level vector is computed by taking the element-wise difference of each vector in the sentence representations. The resulting difference is the discriminating representative vector of the pair of sentences, which is used as feature vector for
Experimental setup
Before explaining and discussing the evaluation of our proposed method for paraphrase identification, and comparing it against the state-of-the-art approaches, we first describe how our experiments have been set up, including the datasets, and performance measures that we have used.
In our comparative study, we used the results that were originally reported by the authors of the papers in which the baseline methods were proposed. This means that we did not re-run the experiments again, assuming
Results and discussion
In this section we present the results from using both datasets that we presented in Section 5.1.
Conclusions
In this paper, we introduced a robust and generic paraphrase detection model based on a deep neural network model, which is able to perform well on both user-generated noisy short texts such as Tweets, and high-quality clean texts. We proposed a pair-wise word similarity model, which can capture fine-grained semantic corresponding information between each pair of words in given sentences. In addition, we used a hybrid deep neural network that extracts coarse-grained information by developing
Acknowledgements
This work was carried out during the tenure of an ERCIM ‘Alain Bensoussan Fellowship Programme’ by the First author of the paper. The work has mainly been carried out at the Telenor-NTNU AI-Lab, Norwegian University of Science and Technology (NTNU), Norway.
References (48)
- et al.
Paraphrase identification and semantic text similarity analysis in arabic news tweets using lexical, syntactic, and semantic features
Information Processing and Management
(2017) - et al.
Boosting paraphrase detection through textual similarity metrics with abductive networks
Applied Soft Computing
(2015) - et al.
Combining sentence similarities measures to identify paraphrases
Computer Speech and Language
(2018) - et al.
A systematic study of knowledge graph analysis for cross-language plagiarism detection
Information Processing and Management
(2016) - et al.
Wikipedia-based information content and semantic similarity computation
Information Processing and Management
(2017) - et al.
Interpretable semantic textual similarity: Finding and explaining differences between sentences
Knowledge-Based Systems
(2017) - et al.
SyMSS: A syntax-based measure for short-text semantic similarity
Data and Knowledge Engineering
(2011) - et al.
A simple but tough-to-beat baseline for sentence embeddings
Proceedings of the 5th international conference for learning representations (ICLR 2017)
(2017) - et al.
Enriching word vectors with subword information
Transactions of the Association for Computational Linguistics
(2017) - et al.
Natural language processing (almost) from scratch
Journal of Machine Learning Research
(2011)
Paraphrase identification as probabilistic quasi-synchronous recognition
Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (ACL 2009)
A paraphrase and semantic similarity detection system for user generated short-text content on microblogs
Proceedings of the 26th international conference on computational linguistics (COLING 2016)
Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources
Proceedings of the 20th international conference on computational linguistics (COLING 2004)
Twitter paraphrase identification with simple overlap features and SVMs
Proceedings of the 9th international workshop on semantic evaluation (SemEval@NAACL-HLT 2015)
Learning to recognize ancillary information for automatic paraphrase identification
Proceedings of the North American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT 2016)
Modeling sentences in the latent space
Proceedings of the 50th annual meeting of the association for computational linguistics (ACL 2012)
Tree edit models for recognizing textual entailments, paraphrases, and answers to questions
Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (HLT 2010)
Long short-term memory
Neural Computing
Convolutional neural network architectures for matching natural language sentences
Multi-granularity neural sentence model for measuring short text similarity
Proceedings of the 22nd international conference on database systems for advanced applications (DASFAA 2017)
Discriminative improvements to distributional sentence similarity
Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP 2013)
An iterative approach for the global estimation of sentence similarity
PloS One
TKLBLIIR: detecting Twitter paraphrases with TweetingJay
Proceedings of the 9th international workshop on semantic evaluation (SemEval@NAACL-HLT 2015)
Short text similarity with word embeddings
Proceedings of the 24th ACM international on conference on information and knowledge management
Cited by (103)
Urdu paraphrase detection: A novel DNN-based implementation using a semi-automatically generated corpus
2023, Natural Language EngineeringIdentification of paraphrased text in research articles through improved embeddings and fine-tuned BERT model
2024, Multimedia Tools and ApplicationsSiamese BERT Architecture Model with attention mechanism for Textual Semantic Similarity
2023, Multimedia Tools and ApplicationsSMS sentiment classification using an evolutionary optimization based fuzzy recurrent neural network
2023, Multimedia Tools and Applications