A deep network model for paraphrase detection in short text messages

https://doi.org/10.1016/j.ipm.2018.06.005Get rights and content

Abstract

This paper is concerned with paraphrase detection, i.e., identifying sentences that are semantically identical. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Recognizing this importance, we study in particular how to address the challenges with detecting paraphrases in user generated short texts, such as Twitter, which often contain language irregularity and noise, and do not necessarily contain as much semantic information as longer clean texts. We propose a novel deep neural network-based approach that relies on coarse-grained sentence modelling using a convolutional neural network (CNN) and a recurrent neural network (RNN) model, combined with a specific fine-grained word-level similarity matching model. More specifically, we develop a new architecture, called DeepParaphrase, which enables to create an informative semantic representation of each sentence by (1) using CNN to extract the local region information in form of important n-grams from the sentence, and (2) applying RNN to capture the long-term dependency information. In addition, we perform a comparative study on state-of-the-art approaches within paraphrase detection. An important insight from this study is that existing paraphrase approaches perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts, and vice versa. In contrast, our evaluation has shown that the proposed DeepParaphrase-based approach achieves good results in both types of texts, thus making it more robust and generic than the existing approaches.

Introduction

Twitter has for some time been a popular means for expressing opinions about a variety of subjects. Paraphrase detection in user-generated noisy texts, such as Twitter texts,1 is an important task for various Natural Language Processing (NLP), information retrieval and text mining tasks, including query ranking, plagiarism detection, question answering, and document summarization. Recently, the paraphrase detection task has gained significant interest in applied NLP because of the need to deal with the pervasive problem of linguistic variation.

Paraphrase detection is an NLP classification problem. Given a pair of sentences, the system determines the semantic similarity between the two sentences. If the two sentences convey the same meaning, then it is labelled as paraphrase; otherwise, it is labeled as non-paraphrase. Most of the existing paraphrase systems have performed quite well on clean text corpora, such as the Microsoft Paraphrase Corpus (MSRP) (Dolan, Quirk, & Brockett, 2004). However, detecting paraphrases in user-generated noisy Tweets is more challenging due to issues like misspelling, acronyms, style and structure (Xu, Ritter, Callison-Burch, Dolan, & Ji, 2014). In addition, measuring the semantic similarity between two short sentences is very difficult due to the lack of common lexical features (Kajiwara, Bollegala, Yoshida, & Kawarabayashi, 2017). Although little attention has been given to paraphrase detection in noisy short-texts thus far, some initial work has been reported on the SemEval 2015 benchmark Twitter dataset (Dey, Shrivastava, Kaushik, 2016, Xu, Callison-Burch, Dolan, 2015, Xu, Ritter, Callison-Burch, Dolan, Ji, 2014). Unfortunately, the best performing approaches on one dataset doesn’t seem to perform as good when evaluated against another. As we discuss later in this paper, the state-of-the-art approach for the SemEval dataset proposed by Dey et al. (2016) does not have good performance (in form of F1-score) when evaluated on the MSRP dataset. Similarly, Ji and Eisenstein (2013) is the best performing approach on the MSRP dataset, but does not perform well on the SemEval dataset. In conclusion, existing approaches are not very generic, but rather are highly dependent on the data used for training.

Focusing on the problem discussed above, the main goal of this work is to develop a robust paraphrase detection model based on deep learning techniques that is able to successfully detect paraphrasing in both noisy and clean texts. More specifically, we propose a hybrid deep neural architecture composed by a convolutional neural network (CNN) and a recurrent neural network (RNN) model, further enhanced by a novel word-pair similarity module. The proposed paraphrase detection model is composed of two main components: (1) sentence modelling and (2) pair-wise word similarity matching. First, sentence modelling concerns building an effective model to represent the text. To do this, we build a joint CNN and RNN architecture that takes the local features extracted by the CNN as input to the RNN. We take word embeddings as input to the CNN model. Then, after convolutions and pooling operations, the encoded feature maps are taken in sequence as input to the RNN model. The last hidden state learned by the RNN model is considered as the sentence level representation. The main rationale behind using both CNN and RNN here is that the CNN is able to learn the local features in form of important n-grams of the texts; whereas RNN takes words in a sequential order and is able to learn the long-term dependencies of texts rather than local features. Second, a pair-wise similarity matching model is used to extract fine-grained similarity information between pairs of sentences. Initially, a pair-wise similarity matrix is constructed by computing the similarity of each word in a given sentence to all the words in another sentence. We then apply a CNN onto this similarity matrix to analyse the patterns in the semantic correspondence between each pair of words in the two sentences that are intuitively useful for paraphrase identification. The idea to apply convolutions over the similarity matrix to extract the important word-word similarity pairs is motivated by how convolutions over text can extract the most important parts of a sentence.

In this paper, we show how the proposed model for paraphrase detection can be enhanced by employing an extra set of statistical features extracted from the input text. To demonstrate its robustness, we evaluate the proposed approach and compare it with the state-of-the-art models, using two different datasets, covering both noisy user-generated texts – i.e., the SemEval 2015 benchmark Twitter dataset, and clean texts – i.e., the Microsoft Paraphrase Corpus (MSRP).

The main contributions of this work can be summarized as follows:

  • 1.

    We propose a novel deep neural network architecture leveraging coarse-grained sentence-level features and fine-grained word-level features for detecting paraphrases on noisy short text from Twitter. The model combines sentence-level and word-level semantic similarity information such that it can capture semantic information at each level. When the text is grammatically irregular or very short, the word-level similarity model can provide useful information; while the semantic representation of the sentence provide useful information otherwise. In this way both model-components complement each other and provide an efficient overall performance.

  • 2.

    We show how the proposed pair-wise similarity model can be used to extract word-level semantic information, and demonstrate its usefulness in the paraphrase detection task.

  • 3.

    We propose a method combining statistical textual features and features learned from the deep architecture.

  • 4.

    We present an extensive comparative study for the paraphrase detection problem.

The rest of the paper is organized as follows: In Section 2, we formally define the problem. In Section 3, we discuss related work concerning paraphrase detection. In Section 4, we motivate our work and present our proposed solution in detail. In Section 5, we describe the experimental setup. In Section 6, we evaluate the approach and discuss the results. Finally, in Section 7, we conclude the paper and outline plans for future research.

Section snippets

Problem statement and goals

Let S1 and S2 be two sentences, such that S1 ≠ S2. S1 and S2 are said to be paraphrased if they convey the same meaning and are semantically equivalent. Now, assume that we have a collection of N annotated sentence pairs (S1i, S2i), having annotations ki, for i = 1,2,,N. For a given i, ki indicates whether the ith sentence pair is paraphrased or non-paraphrased. The problem addressed in this paper is to develop a model, which can reliably label a previously unseen sentence pair as paraphrased

Related work

The use of deep neural network for natural language processing (NLP) has increased considerably over the recent years. Most of the previous work on paraphrase detection have focused on features like n-gram overlap features (Madnani, Tetreault, & Chodorow, 2012), syntax features (Das & Smith, 2009), linguistic features (Sahi, Gupta, 2017, Vani, Gupta, 2018) Wikipedia-based semantic networks (Jiang, Bai, Zhang, & Hu, 2017), knowledge graphs (Franco-Salvador, Rosso, & Montes-y Gómez, 2016) and

DeepParaphrase architecture

We propose a deep learning-based approach for detecting paraphrase sentences for Tweets, with the architecture depicted in Fig. 1. We first convert each sentence in a pair into a semantic representative vector, using a CNN and an RNN. Then, a semantic pair-level vector is computed by taking the element-wise difference of each vector in the sentence representations. The resulting difference is the discriminating representative vector of the pair of sentences, which is used as feature vector for

Experimental setup

Before explaining and discussing the evaluation of our proposed method for paraphrase identification, and comparing it against the state-of-the-art approaches, we first describe how our experiments have been set up, including the datasets, and performance measures that we have used.

In our comparative study, we used the results that were originally reported by the authors of the papers in which the baseline methods were proposed. This means that we did not re-run the experiments again, assuming

Results and discussion

In this section we present the results from using both datasets that we presented in Section 5.1.

Conclusions

In this paper, we introduced a robust and generic paraphrase detection model based on a deep neural network model, which is able to perform well on both user-generated noisy short texts such as Tweets, and high-quality clean texts. We proposed a pair-wise word similarity model, which can capture fine-grained semantic corresponding information between each pair of words in given sentences. In addition, we used a hybrid deep neural network that extracts coarse-grained information by developing

Acknowledgements

This work was carried out during the tenure of an ERCIM ‘Alain Bensoussan Fellowship Programme’ by the First author of the paper. The work has mainly been carried out at the Telenor-NTNU AI-Lab, Norwegian University of Science and Technology (NTNU), Norway.

References (48)

  • D. Das et al.

    Paraphrase identification as probabilistic quasi-synchronous recognition

    Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (ACL 2009)

    (2009)
  • K. Dey et al.

    A paraphrase and semantic similarity detection system for user generated short-text content on microblogs

    Proceedings of the 26th international conference on computational linguistics (COLING 2016)

    (2016)
  • B. Dolan et al.

    Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources

    Proceedings of the 20th international conference on computational linguistics (COLING 2004)

    (2004)
  • A. Eyecioglu et al.

    Twitter paraphrase identification with simple overlap features and SVMs

    Proceedings of the 9th international workshop on semantic evaluation (SemEval@NAACL-HLT 2015)

    (2015)
  • S. Filice et al.

    Learning to recognize ancillary information for automatic paraphrase identification

    Proceedings of the North American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT 2016)

    (2016)
  • W. Guo et al.

    Modeling sentences in the latent space

    Proceedings of the 50th annual meeting of the association for computational linguistics (ACL 2012)

    (2012)
  • M. Heilman et al.

    Tree edit models for recognizing textual entailments, paraphrases, and answers to questions

    Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (HLT 2010)

    (2010)
  • S. Hochreiter et al.

    Long short-term memory

    Neural Computing

    (1997)
  • B. Hu et al.

    Convolutional neural network architectures for matching natural language sentences

  • J. Huang et al.

    Multi-granularity neural sentence model for measuring short text similarity

    Proceedings of the 22nd international conference on database systems for advanced applications (DASFAA 2017)

    (2017)
  • Y. Ji et al.

    Discriminative improvements to distributional sentence similarity

    Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP 2013)

    (2013)
  • T. Kajiwara et al.

    An iterative approach for the global estimation of sentence similarity

    PloS One

    (2017)
  • M. Karan et al.

    TKLBLIIR: detecting Twitter paraphrases with TweetingJay

    Proceedings of the 9th international workshop on semantic evaluation (SemEval@NAACL-HLT 2015)

    (2015)
  • T. Kenter et al.

    Short text similarity with word embeddings

    Proceedings of the 24th ACM international on conference on information and knowledge management

    (2015)
  • Cited by (103)

    View all citing articles on Scopus
    View full text