1 Introduction

A steep increase can be found in the volume of web content over the years, including different websites, blog posts, social media posts, and news publications flooding the Internet with textual materials. The Information Retrieval (IR) and Information Extraction (IE) techniques are useful tools to search and extract contents from the Internet depending on user’s choice.

Text summarization techniques are mainly of two types, abstractive and extractive [1, 2]. The abstractive summarization presents the document’s important materials in a different and condensed fashion through an intricate procedure [3]. In contrast, the extractive approach identifies and picks key portions of the text for summarization [4,5,6]. The extractive summarizations are mainly of two types (i) generic [7] and (ii) query focused [8]. The generic summarization is performed considering the general topics of the document. Whereas the query focused summarization (QFS) [9] considers the query as the main topic to extract text relevant to that. The vector space model (VSM) is an IR method that represents documents/sentences using vectors on the vector space and evaluates their proximity to decide their relationship. VSM ranks a set of documents/sentences [10] based on their proximity to a piece of contextual text called query. The centroid based summarization [11, 12] method clusters documents and finds a centroid representation for each class. Then the centroids are used to classify the forthcoming documents.

Different news readers may be concerned about different impacts of COVID-19 such as infection, death, economy, livelihood, education, etc. This work finds patterns in the words of the COVID-19 news-sentences chosen by an individual reader. Then using the pattern, it can classify the sentences from forthcoming reports to extract a summary for that individual. In this work some digitally published news articles on COVID-19 are collected from “The Hindu” daily. The sentences of those articles are labeled by three volunteers. Each of the sentences are labeled with either 1 (positive) or 0 (negative). A centroid or query is formed from the frequencies of the constituent words in the train set. Finally VSM is applied to filter sentences depending on their cosine similarity to the query on the vector space. The methodology is elaborated in section 3 and a flow diagram is shown in Fig. 1. The following are the contributions of this paper:

  • Prepared a gold standard dataset by processing and labeling 4988 unique sentences from COVID-19 news for the experiments.

  • Proposed a technique to get the centroid or query from the positive class sentences.

  • Compared three approaches for generating suitable sentence embeddings.

  • Proposed a technique to find a threshold parameter that defines the decision boundary to classify the sentences.

The organization of the rest of the paper is as follows: A review of the literature has been presented in section 2. In Sect. 3 the methodologies involving the acquisition, preprocessing, and labeling of the data are discussed. It also introduces the techniques for query formation and QFS. Experimental results are reported in Sect. 4. A discussion on obtained results is presented in Sect. 5. Finally, Sect. 6 concludes the paper.

Fig. 1
figure 1

Flow diagram of the work. Module-1: Finding the centroid/query from labeled sentences, Module-2: Sentence classification using the centroid

2 Related works

There are existing works where researchers have reported methods to find a succinct form of documents from various sources by extracting salient sentences [13]. The famous centroid based multi-document summarization technique introduced in [11] has been one of the promising techniques in this area. Another centroid based technique by Radev et al. [12] presents a sentence ranking and extraction based summarization. The news-summarization tool NewsInEssence [14] used the techniques in [11, 12] for document-clustering and sentence extraction to generate QFS. An interesting centroid based approach [15] used topic wise key-phrases to classify sentences and documents. The popular word2vec word-embedding technique which considers cooccurrences of words is used to summarize documents [16]. The centroid based model in [17] followed the law of universal gravitation to classify a document. A generic summarization technique is presented in [18] that ranks sentences depending on linguistic and statistical features of the words/terms. The Drag Pushing and Large Margin Drag Pushing [19, 20] provides a solution to the common inductive bias or model misfits with normalized centroid to classify sentences. A similar classification approach is presented in [21] that works on adjustments of the model. The Generalized Cluster Centroid based Classifier (GCCC) in [22] combines the powers of the K-nearest-neighbor and the Rocchio classifiers for text categorization. The Border-Instance-based Iteratively Adjusted Centroid Classifier in [23] focused on the border instances while computing the class centroids. The multiclass classifier in [24] relies on inter-class and inner-class term distributions to compute the term weights to find the classification of documents. An unsupervised extractive multi-document summarization has been presented in [25] that finds relevant sentences by comparing the sentence embeddings with document centroid. Researchers have also explored hybrid techniques where the centroid based technique is combined with Random Indexing [26] and k-means clustering [27] for multi-document summarization.

The popular sentence extraction based approaches other than centroid based approaches are discussed below. A Support Vector Regression (SVR) based model in [28] finds sentence scores depending on query features for extraction. The method introduced in [29] scores sentences with Probabilistic modeling Relevance and coverage for ranking and extraction. The authors in [8] reported a manifold-ranking based approach for QFS. A similar theme clustering and sentence ranking based approach has been introduced in [30]. The method in [31] followed a combination of different sentence embedding and selector functions for document summarization. Authors in [32] have selected salient sentences through distinguishing discriminative topics covered by each sentence. A sentence embedding based on the embedding of POS tags, bi-grams, and tri-grams is introduced in [33]. A graph based microblog clustering algorithm for the selection of important sentences is presented in [34]. A similar graph based approach for QFS has been presented in [35]. A Bayesian topic modeling based supervised approach has been proposed in [36].

The proposed method presents a statistical approach involving the frequent words in both the positive and the negative classes to form the positive class centroid, which is unavailable in the literature. Moreover, the supervised technique designed to calculate the sentence classification threshold parameter is a new technique presented in this method.

Table 1 First 10 labeled sentences; column-2 sentences, column-3, 4 and 5 are given labels by annotators, column-6 (final label) is the majority vote of the labels in previous three columns

3 Methodology

This section elaborates on the method in a step-wise fashion. The first step of the proposed approach prepares the required data for experiment. In the subsequent steps, the query is formed out of the train set of labeled sentences. Then, for each sentence, a vector is generated. In the final step, the similarities between each sentence and the query are computed which is followed by setting the threshold parameter of similarities to classify sentences.

3.1 Dataset building

Generally, for performance assessment, experiments are carried out on a standard dataset to compare the outcome with state-of-the-art. However, there are cases where a dataset suitable for a specific experiment is unavailable and it becomes necessary to prepare a dataset for the current requirement [37]. The dataset in this work [38] is prepared by collecting 3308 documents from “The Hindu” archive from 23rd January to 24th April 2020. The title and main text of each news are combined to make a sample. During preprocessing, the URLs, punctuations (except ‘.’), digits, contracted words (like isn’t, we’ve, they’ll), apostrophes, and stop-words (like I, we, you, he, they) are removed. Finally, word stemming is performed using the porter stemmer technique available in nltk library [39]. Out of the 3308 collected documents containing 57668 sentences, the first 300 documents having 4988 unique sentences are labeled. The whole corpus and the labeled subset have 24868 and 6189 unique words respectively. The labeling is done with 1 (relevant) or 0 (irrelevant) (Table 1) for positive and negative classes respectively. The whole corpus 3308 documents are used to train the word2vec [40] model that generates word vectors for the WVA (section 3.3.2) and the auto-encoder approach (section 3.3.3).

Three research scholars (acknowledged in the Acknowledgement section) have volunteered to annotate the sentences as label1, label2, and label3 respectively (Table 1). They labeled each sentence with 1 and 0 if the sentence contains any of the below information and if it does not, respectively.

  • Spread of infection in a geographical area.

  • Death due to COVID-19 pneumonia and respiratory failure.

  • A number of persons suspected to be infected with COVID-19.

  • A number of persons infected or reinfected with COVID-19.

  • A number of persons admitted to medical center due to COVID-19 symptoms.

  • A number of persons suffering from pneumonia.

  • Some of them are improving or cured.

  • Health related updates from governments or WHO on COVID-19.

Among the 4988 sentences, 1308 and 3680 are labeled with ’1’ and ’0’ in the final label. There are sentences in the corpus that are difficult to categorize due to the priority and amount of information they carry. Hence, there are mixed opinions among the annotators for those sentences. The final label of each sentence is found by taking majority vote among the given three labels. The following are examples of such sentences:

  1. 1.

    Roadmap for research No specific treatment or vaccine against the virus exists and World Health Organization has repeatedly urged countries to share data in order to further research into the disease.

  2. 2.

    World Health Organization had earlier given the virus the temporary name of 2019 nCoV acute respiratory disease and China National Health Commission this week said it was temporarily calling it novel coronavirus pneumonia or NCP.

  3. 3.

    The virus has killed more than 1000 people infected over 42000 and reached some 25 countries with the World Health Organization declaring a global health emergency.

The first example is considered as relevant as two out of three annotators (1 and 2) labeled it as ‘1’. But for the second example, it is considered as irrelevant as the labels 2 and 3 are ‘0’. The third example has some important quantities and events in it, so it is labeled ‘1’ by all three annotators. The first annotator labeled 1238 (label1), the second annotator labeled 1337 (label2), and the third annotator labeled 1349 (label3) sentences as ’1’. The given labels in label1, label2, and label3 are the same as the final label (irrespective of ’1’ or ’0’) in 4814, 4919, and 4903 cases respectively. There are 1148 and 3516 cases where the label is unanimously given 1 and 0 respectively, and 325 cases where the opinions differed. The interrater agreement score is 0.8881 calculated using Fleiss Kappa [41] technique.

3.2 Module-1: forming the query

The frequency of a word determines its significance in an article and the importance of a sentence depends on the relative position of words with different significance in it [42]. The frequency of query terms decides the importance of a sentence in the Term Frequency (TF) technique [43], where the weightage becomes higher with a higher frequency of query terms. However, in this case, which sentences are in positive class and in negative class is already known. The task here is to collect the significant words that can represent the positive class. Since the frequently occurring words in a document can play a significant role in classifying the document [44], the plan was to find the frequently used words from both the classes in the train set. The frequent words from the positive class are collected and the frequent words common in both classes are dropped from the set of collected words. Then the collected frequent words are used to form the query such that the query would be close to the positive class and distant from the negative class. To restrict the length of the query within the length of the longest sentence (say max_len) in the dataset, only top frequent max_len words are picked from both classes. The 5 queries that have come from 5 train sets of the 5-step cross-validation are given below.

  1. 1.

    Death confirm first posit number south korea total die toll februari quarantin wuhan tuesday year neg outsid read citi monday

  2. 2.

    Death confirm posit first number die neg itali toll total hubei wuhan year februari tuesday south outsid organ read

  3. 3.

    death confirm posit number first die toll wuhan total hubei south tuesday neg februari itali korea year provinc read monday organ

  4. 4.

    Death confirm first number posit die wuhan toll februari total iran south korea hubei itali neg provinc organ ship outsid novel

  5. 5.

    Death confirm first number posit die toll total wuhan hubei south korea itali provinc neg organ ship person outsid sunday

In this experiment, it is assumed that the sentences in the news with more similar words or lower semantic distance with the formed query have a higher chance of carrying significant information. The set of words in the obtained query is as per the Eq. 1.

$$\begin{aligned} query = W(P) - W(N), \end{aligned}$$
(1)

where W(P) and W(N) are sets of top max_len frequent words (Fig. 2) in positive and negative classes respectively.

Fig. 2
figure 2

Most frequent words in a positive and b negative set of sentences (to avoid congestion only 40 words are shown)

3.3 Module-2: sentence classification

In this stage, each sentence is represented using a vector based on its constituting words. The uniqueness of a sentence vector keeps it at its appropriate position on the vector space in relation to others depending on the lexical and semantic properties of its words. As per [45], it is very important to learn the representation of the data to design an efficient classifier. The approaches in [46, 47] have introduced standard deviation and relevance frequency based term weighting techniques depending on the distribution of the terms in a corpus. The most popular neural network based models to learn word embeddings are word2vec and Global Vectors (GloVe) (compared in [48]). In this work, the TF-IDF and word2vec are preferred for the generation of word embeddings.

Following are the three approaches that are investigated in this work for sentence vector generation.

3.3.1 Term frequency—inverse document frequency (TF-IDF) based approach

TF-IDF [12] is a statistical measure computed as the product of two different statistics Term Frequency (TF) and Inverse Document Frequency (IDF) [49] to calculate the weightage score of a word in a document depending on its frequency in that document and occurrence in all other documents. In this work, a sentence is considered as a document and the embedding of a sentence is generated using the TF-IDF scores of its words. In TF-IDF, the TF ranks the candidate sentences in higher the frequency of query terms, higher the rank, fashion. The normalized term frequency (TF) is calculated with equation 2.

$$\begin{aligned} TF = \frac{TF_{s,t}}{n}, \end{aligned}$$
(2)

where \(TF_{s,t}\) is the frequency of term t in a sentence s and n is the number of words in s.

In case of IDF, it gives higher weights to the rare query terms than the frequent query terms. If the number of sentences containing t is \(sf_t\) and there is N number of total sentences, then the IDF is calculated with Eq. 3.

$$\begin{aligned} IDF = \log \frac{N}{sf_t}, \end{aligned}$$
(3)

and the TF-IDF vector of sentence s is calculated with Eq. 4.

$$\begin{aligned} \Bigg [ \frac{TF_{s,1}}{n}\times \log \frac{N}{sf_1}, \frac{TF_{s,2}}{n}\times \log \frac{N}{sf_2}, ..., \frac{TF_{s,W}}{n}\times \log \frac{N}{sf_W} \Bigg ], \end{aligned}$$
(4)

where W is the total number of words in the vocabulary.

3.3.2 Word vector averaging (WVA) based approach

In this approach, [50] all the word vectors of a sentence are averaged to get the sentence vector. As per Le [51] it may be unsuitable where word ordering is important. This approach is suitable in this case as it concerns only the presence of specific words in a sentence. Suppose there are n words in a sentence and each one is embedded with a vector of size m. It forms an \(n \times m\) matrix, where each row represents a word. Now, the average of all the values in each of the m columns will be the sentence vector of size m. Let, a specific sentence s has n words \((W_1, W_2, ... , W_n)\) and the word vector \(W_i \in \mathbb {R}^m\) is represented by \((x_{(i,1)}, x_{(i,2)}, ..., x_{(i,m)})\), then the vector for the sentence s is given in Eq. 5

$$\begin{aligned} \Bigg ( \frac{1}{n}\sum _{i=1}^n x_{(i,1)}, \frac{1}{n}\sum _{i=1}^n x_{(i,2)}, ..., \frac{1}{n}\sum _{i=1}^n x_{(i,m)}\Bigg ). \end{aligned}$$
(5)

3.3.3 Auto-encoder based approach

The auto-encoder [52] is an unsupervised learning model implemented with neural networks. Its architecture has two sequential layers, i.e., Encoder and Decoder [53]. In the training phase, the input to this network is encoded to a low dimensional latent space which is reconstructed by the decoder to the output (Fig. 3). When the reconstruction error is the least the encoder part may be used to compress an input.

It is worth mentioning that the popular models for language processing, specifically sentence embedding, and semantic understanding tasks use RNN based units [54, 55] that captures not only the constituting words but also their sequence in a part of the text (document/ paragraph/ sentence), which is not the case in this problem. This problem demands inspection of the exact or similar words in a sentence for a possible match with the query.

Hence, the auto-encoder model is implemented here using Convolutional Neural Network (CNN) layers. The CNN layers along with Maxpooling layers help in the compression of the input while preserving the important features. The auto-encoder in Fig. 3 takes a sentence as input represented as \(X \in \mathbb {R}^{100 \times 100}\). In the first stage, the input matrix passes through the encoder part which is having a set of layers consisting of 4 Convolutional 2D layers with relu as activation function, interwoven with 3 Maxpooling layers condensing the input data into \(25 \times 25\) matrix. Finally, a Flatten layer followed by a Dense layer with 100 neurons and relu as activation function producing the compressed low dimensional latent space of the input sentence. In the second stage, the latent space goes into the decoder part starting with a Dense layer with relu as the activation function followed by a Reshape layer producing a \(25 \times 25\) matrix. It is followed by 3 Convolutional 2D layers with relu as the activation function interwoven with 2 Up-sampling layers. Finally, ending with a Flatten and Dense (sigmoid as activation function) and Reshape layers producing the regenerated matrix Y \((Y \in \mathbb {R}^{100 \times 100})\) of the sentence.

Fig. 3
figure 3

Details of the auto-encoder based model

Fig. 4
figure 4

Plotting the classified train sentence vectors, (Tables 2, 3, 4, step5) based on (a) actual labels, (b) predicted labels case1, and (c) predicted labels case2 (query is in red square, positive and negative data points are in blue and green diamonds respectively); for convenience an arrow-head is used to indicate the query (Color figure online)

Fig. 5
figure 5

Plotting the classified test sentence vectors, (Tables 2, 3, 4, step5) based on (a) actual labels, (b) predicted labels case1, and (c) predicted labels case2 (query is in red square, positive and negative data points are in blue and green diamonds respectively); for convenience an arrow-head is used to indicate the query (Color figure online)

3.3.4 Determining the cosine similarity threshold for classification

The key parameter defining the decision boundary for sentence classification is the threshold value of similarity between the query/centroid vector and each sentence vector. The learning of the best value of this parameter is done through a supervised way where the outcomes (F1 score) of a range of values for this parameter are calculated and the one with the best outcome is selected.

At first, the news documents are divided into 80% (240 documents) and 20% (60 documents) for training and testing, respectively. Then, the cosine similarity between the query and each sentence in the train set is determined. Then, the threshold value is initialized at threshold_init. The sentences having a similarity value equal to or above the threshold are labeled with 1 and the rest are labeled with 0. Next, the assigned labels are compared with the ground truth and the F1 score is calculated. This process is repeated by incrementing the threshold value by 0.01 each time until it reaches threshold_limit. Finally, the threshold value responsible for the highest F1 score is considered as the parameter to be used to classify sentences. The best threshold value determination is done for the below two cases:

  • case1: The true and predicted labels of all the sentences across all the train set documents are compared to calculate the F1 score.

  • case2: The true and predicted labels of the sentences in each document are compared to calculate the F1 scores for all 240 train set documents. Finally, the mean of the 240 F1 scores is considered to determine the best threshold value.

The design of this sentence classifier may be considered unique in its category. The identifiers used in Algorithm 1 which is followed to calculate case1 and 2 best thresholds are explained below.

Let, there are N sentences across all the M documents. V is the set of N sentence-vectors and \(v_i \in V\), \(1<i<N\). L and PL are the sets of ground truth and predicted sentence labels respectively. The algorithm takes M, N, and L as inputs. D is the set of similarity values between the query and each sentence, and \(d_i \in D\), \((1<i<N)\). T_all_vs_F1 and T_doc_vs_F1 are the sets of of threshold and corresponding F1 score pairs for case1 and case2 respectively. The algorithm outputs \(\theta _1\) and \(\theta _2\) are the best thresholds for case1 and case2. The initial threshold threshold_init = 0.6 for WVA and auto-encoder approaches and threshold_init = 0.01 for TF-IDF approach. threshold_limit = 0.9 for WVA and auto-encoder approaches and threshold_limit = 0.03 for TF-IDF based approach. The function threshold_for_max_F1() returns the threshold for the highest F1 score from the lists T_all_vs_F1 and T_doc_vs_F1. The measures such as Precision (Eq. 6), Recall (Eq. 7), F1_score (Eq. 8) and cosine_similarity (Eq. 9) are calculated as follows:

$$\begin{aligned} Precision= & {} \frac{TP}{TP + FP}, \end{aligned}$$
(6)
$$\begin{aligned} Recall= & {} \frac{TP}{TP + FN}, \end{aligned}$$
(7)
$$\begin{aligned} F1\_score= & {} 2 \times \frac{Precision \times Recall}{Precision + Recall}, \end{aligned}$$
(8)

where TP, TN, FP, and FN denote True Positive, True Negative, False Positive, and False Negative, respectively.

$$\begin{aligned} cosine\_similarity(\overrightarrow{X_1}, \overrightarrow{X_2}) = \frac{ \overrightarrow{X_1} \cdot \overrightarrow{X_2}}{ \sqrt{\Sigma {x_{1i}^2}} + \sqrt{\Sigma x_{2i}^2} }, \end{aligned}$$
(9)

\(x_{1i} \in \overrightarrow{X_1}\), \(x_{2i} \in \overrightarrow{X_2}\) and \(1<i<vector\;length\).

figure a
Fig. 6
figure 6

Threshold versus F1 score plot considering (A) case1 and (B) case2, using the three approaches (Tables 2, 3, 4, step5)

3.4 Extractive summarization

Finally, the trained threshold parameter (Sect. 3.3.4) is used to identify the important sentences in a document. Then depending on the similarity with the centroid sentences are ranked and the top 10% are selected as the summary of the document.

4 Experimental results

In this work, Three sentence representation approaches (Sects. 3.3.1, 3.3.2, 3.3.3) are examined, and the results are compared. Then, a threshold parameter is determined that classifies a sentence based on its cosine-similarity with the positive class centroid (Sect. 3.3.4). The cosine-similarity values for the train set sentences range from 0.0 to 0.40 for TF-IDF, from 0.11 to 0.97 for WVA, and from 0.36 to 0.92 for auto-encoder approaches (found in step5 of 5-fold cross-validation). The 3-dimensional (3-D) view of train and test set sentence vectors are shown in Figs. 4, 5, respectively. The dimensions of the vectors are reduced using Principal Component Analysis (PCA) to plot the data. The 3D plots give a good perception of the concept. It can be easily observed from the leftmost plots (marked A) of each row that the query (red square) is closer to the positively labeled sentences (blue diamonds) than the negatively labeled sentences (green diamonds). The classification can be visually perceived by comparing the figures for case1 (B) and case2 (C) in each row with the leftmost (A) figure (data with original labels). The classification boundary drawn by the similarity threshold value can also be compared for case1 and case2 for all the sentence embedding techniques. The best threshold of the cosine-similarities is obtained by comparing the F1 scores for different threshold values and picking the best one. The best threshold values that are found after this process are 0.75 (for case1) and 0.69 (for case2) for WVA, 0.65 (for case1), 0.62 (for case2) for auto-encoder approaches and 0.01 (for both cases) for TF-IDF approach. Different threshold values and corresponding F1 scores are depicted in Fig. 6A for case1 and in Fig. 6B for case2. The accuracy measure (equation 10) is also used to find the share of the correct prediction by this method. [56].

$$\begin{aligned} Accuracy= & {} \frac{TP + TN}{TP+TN+FP+FN}. \end{aligned}$$
(10)

The k-fold cross-validation \((k = 5)\) is used here to evaluate the sentence-embedding approaches. At first, the labeled 300 documents are divided into five folds, each having 60 documents. In each step of the five-step process, one fold is considered as the test, and the rest is considered as the train set. The outcomes of all the steps are averaged to get the final result. The sentence share in train and test sets in the first step 3826 and 1162, in the second step 4035 and 953, in the third step 4132 and 856, in the fourth step 4100 and 888, and in the fifth step 3859 and 1129. In the mean result, the WVA returns 0.80 and 0.74 accuracy scores on train and test data, respectively, and 0.64 as the F1 score for both, in case1. It returns 0.77 and 0.75 as the accuracy on train and test data, respectively, and 0.47 as the F1 score for both, in case2 (Table 2). The auto-encoder based approach has shown 0.78 and 0.77 as accuracy and, 0.64 and 0.63 as F1 scores for train and test data, respectively, in case1. It has shown 0.50 and 0.51 as F1 scores and 0.74 and 0.72 as accuracy for train and test data, respectively, in case2 (Table 3). The TF-IDF based approach returns 0.62 and 0.61 as F1 scores and 0.78 as accuracy for both train and test data in case1. It has shown 0.44 and 0.46 as F1 scores, and 0.74 and 0.75 as accuracy for train and test data, respectively, in case2 (Table 4). The data that are presented in this section and plotted in Figs. 4, 5, 6 are from the step5 of the fivefold cross-validation process.

Table 2 Outcome of WVA approach
Table 3 Outcome of auto-encoder based approach
Table 4 Outcome of TF-IDF based approach

5 Discussion

The data in this work is not highly sensitive to false-positive and false-negative predictions, so the accuracy metric (Eq. 10) gives a good performance measurement of the model. The dataset is imbalanced as it has 1308 (26%) in the positive and 3680 (74%) samples in the negative class. So, the precision and recall values are very useful to evaluate the proposed method.

For every approach, the recall score is higher than the precision score (Tables 2, 3, and 4). That simply means that the technique correctly identifies most of the positive samples from all the positive samples in a given (test) set. This is an impressive result shown by the technique, despite being trained with a dataset that is imbalanced and contains more negative than positive samples. On the other hand, the precision score tells what share of the captured positive samples is correct. In this measure, all three approaches have scored almost similarly well. It is observed that the results for case1 are better than case2 in all the parameters. The possible reason for that is the experiment is done for individual documents each having a different ratio of positive and negative class sentences in case2 unlike the case1 where all the sentences across all the documents are considered together.

The auto-encoder approach has shown the best performance considering the recall and F1 scores in both cases. The WVA and TF-IDF approaches came second and third, respectively. The possible reasons behind the results are the following. The word2vec model utilizes word cooccurrence to find embeddings that support the word semantics. The Convolution and Maxpooling layers in the auto-encoder capture the inter-word context among the word embeddings and condense them into a vector that is a good semantic representation of the sentence (see3.3.3). In the WVA approach, the word embeddings in a sentence are simply averaged to get its vector representation. That makes this approach less efficient in capturing the inter-word context present in a sentence, as compared to the previous one (see3.3.2). The TF-IDF is a statistical approach that represents a sentence with a vector of the length of the vocabulary. Each real number in a vector is calculated using the count of each word in that sentence and also in all the sentences in the set 3.3.1. So, it is the least effective in putting the inter-word context of a sentence in its vector. Hence, the relationship between the centroid/query sentence and each sentence in the test set is reflected accordingly for the three sentence representation techniques, which in turn affected the determination of the similarity threshold parameter (see3.3.4). Moreover, It is expected that the performance of the neural network based models word2vec (responsible for generating word embeddings) and the auto-encoder will improve with a dataset containing a higher amount of labeled samples. These are the possible rationale for the performance scores of three sentence representation techniques that play a significant role in efficacious classification of sentences. The centroid based sentence classification method is important where the query sentence used in QFS is generated from a set of labeled sentences.

6 Conclusion

This article presents an important contribution to the literature on centroid based extractive summarization. Each sentence in the corpus is represented using a vector on the VSM to perform sentence classification experiments. Results of the three sentence-vector generation approaches are compared in Tables 2, 3, and 4. A supervised approach has been presented to learn the two threshold parameters for case1 and 2 to classify COVID-19 news sentences. Results show that the WVA approach exhibits the best performance in case1 and the auto-encoder approach performs well in both cases. In future work, the dataset will be extended by labeling more samples in the corpus. The state-of-the-art transformer models may be used as a sentence encoder. Features such as sentence position, length, and key term frequency may also be explored in forming a better classification parameter.