Abstract
As essential information acquisition tools in our lives, mobile social networks have brought us great convenience for communication. However, misleading information such as spam emails, clickbait links, and false health information appears everywhere in mobile social networks. Prior studies have adopted various approaches to detecting this information but ignored global semantic features of the corpus and lacked interpretability. In this paper, we propose a novel endtoend model called TopicAware BiLSTM (TABiLSTM) to handle the problems above. We firstly design a neural topic model for mining global semantic patterns, which encodes word relatedness into topic embeddings. Simultaneously, a detection model extracts local hidden states from text content with LSTM layers. Then, the model fuses those global and local representations with the TopicAware attention mechanism and performs misleading information detection. Experiments on three real datasets prove that the TABiLSTM could generate more coherent topics and improve the detecting performance jointly. Furthermore, case study and visualization demonstrate that the proposed TABiLSTM could discover latent topics and help in enhancing interpretability.
Introduction
Mobile social networks have brought us great facilities for acquiring information. Inevitably, a vast amount of useless misleading information, such as spam emails, clickbait links, and false health information, is created. This information will deceive us to do things with ill consequences. Table 1 gives two examples of how the meanings of content mislead people and impact categories in the WebisClickbait17 dataset. In general, misleading information is deceptive, which makes it hard to distinguish the difference between two kinds of posts (positive and negative). Thus, how to detect misleading information effectively is challenging. Also, developing an efficient approach with high performance for misleading information detection is particularly essential.
Existing work on misleading information detection could be categorized into two types: machine learningbased approaches and deep learningbased approaches. Approaches based on machine learning often build document representations depending on different feature engineering techniques [10, 26, 35]. Various algorithms such as LabeledLDA [35] and GBDT [2] also help enhance detection accuracy. Unfortunately, these approaches heavily rely on people to design sophisticated features and will cause lousy performance in a complex context. Deep learningbased approaches extract semantic features from content by multiple nonlinear units to solve the above problems. Convolutional neural networks [1, 17], recurrent neural networks [23], and a combination of the two [22] are commonly used frameworks. Still, these approaches are limited to local semantic information and severely lack interpretability due to the complex structures.
To address the above limitations, we propose a novel model called TopicAware BiLSTM (TABiLSTM) to add corpuslevel topic relatedness and enhance interpretability. Specifically, the TABiLSTM is decomposed into two parts: a neural topic model module and a text classification module. Assuming that a multilayer neural network can approximate the document’s topic distribution, we model the topic by Wasserstein autoencoder (WAE) [37]. Neural topic model module constructs the topic distribution on latent space and reconstructs the document representation. The topic distribution could be transformed into the topic embedding provided for the attention mechanism concurrently. Unlike variational autoencoderbased approaches previously [29, 36], our model minimizes the Maximum Mean Discrepancy regularizer [15] based on Optimal Transport theory [39] to reduce Wasserstein distance between the topic distribution and Dirichlet prior.
Furthermore, the text classification module utilizes a twolayer bidirectional LSTM based on the TopicAware attention mechanism to extract semantic features. This attention mechanism incorporates topic relatedness information while calculating the representation. Finally, we input representations to the classifier for misleading information detection. To balance the two task learning, we leverage a dynamic strategy to control the importance of these objectives. We concentrate on the neural topic model preferentially, then simultaneously train the classification objective and topic modeling objective.
The main contributions of our work are as follows:

We propose a novel endtoend framework TopicAware BiLSTM for misleading information detection.

We introduce a new TopicAware attention mechanism to encode the document’s local semantic and global topical representation.

Experiments are conducted on three public datasets to verify the effectiveness of our TopicAware BiLSTM model in terms of topic coherence measures and classification metrics.

We select representative cases from different datasets for visualization, demonstrating that the TopicAware BiLSTM enhances interpretability than other traditional approaches.
The remainder of the paper is organized as follows: Section 2 reviews the relevant work, and Section 3 introduces preliminary techniques. Section 4 introduces the methodology of TopicAware BiLSTM model. Experiments and result analysis are given in Section 5. Lastly, in Section 6, we conclude the paper.
Related Work
Our work is related to three lines of research which are misleading information detection, topic modeling and attention mechanism.
Misleading Information Detection
Misleading information detection models could be categorized as two streams based on implementation techniques: machine learningbased approaches and deep learningbased approaches.
Generally, machine learningbased approaches need to design the specific representation of texts. For example, Liu et al. [26] employs both the local and the global features via Latent Dirichlet Allocation and utilizes Adaboost to detect spammer. Likewise, Chakraborty et al. [7] uses multinomial Naive Bayes classifiers for pruned features of Clickbait data. Different models of this branch could also result in different detection performance. Song et al. [35] proposes the labeled latent Dirichlet allocation to mine the latent topics from usergenerated comments and filter social spam. Biyani et al. [2] uses Gradient Boosted Decision Trees [11] to detect clickbait in news streams. Similarly, Elhadad et al. [10] detects misleading information about COVID19 through constructing a voting mechanism. However, approaches of this branch often require sophisticated feature engineering and could not capture deep semantic patterns.
Thanks to the rapid development of deep representation learning, approaches such as convolutional neural networks, recurrent neural networks have been applied to extract semantic representation from text directly. Agrawal [1] and HaiTao et al. [17] utilize a convolutional neural network to detect misleading information from clickbait. Kumar et al. [23] adopts a bidirectional LSTM with an attention mechanism to learn a word contributing to the clickbait score in a different manner. Jain et al. [22] constructs a deep learning architecture based on convolutional layers and long shortterm memory layers. Nevertheless, deep learningbased approaches often have complex structures and severely lack interpretability. Thus, we integrate the neural topic model to provide corpuslevel semantic information and enhance interpretability.
Topic Modeling
Given a collection of documents, each document will discuss different topics. Topic modeling is an efficient technique which could mine latent semantic patterns from corpus.
Latent Dirichlet Allocation (LDA) [3] is the most publicly used traditional probabilistic generative model that can perform topic mining. Unlike traditional graphical topic models, Miao et al. [29] proposes a neural topic model NVDM based on variational autoencoders (VAE). Variational autoencoders use KL divergence to measure the distance between the topic distribution and Gaussian prior. ProdLDA [36] utilizes the approximated Dirichlet prior through Laplace approximation and improves the topic quality. On the other hand, Wang et al. proposes ATM [43], BAT, and GaussianBAT [44] in an adversarial manner. Wang et al. [42] also extends the ATM model for open event extraction. Inspired by ATM model, Hu et al. [20] attempts to improve topic modeling with cycleconsistent adversarial training and names this approach ToMCAT. Zhou et al. [49] extends this line of work by taking into account documents and words as nodes in the graph. Further, autoencoders could be trained stably and reduce the document’s representation dimensionally [25] to extract the most effective information [48]. So Nan et al. [31] incorporates adversarial training into Wasserstein autoencoder framework and proposes WLDA model for unsupervised topic extraction.
Attention Mechanism
The attention mechanism is a brain processing mechanism unique to human vision originally. When we see a picture in life, our brain will prioritize the main content in the image, ignoring the background and other irrelevant information.
Inspired by this mechanism of the human brain, various attention mechanisms have achieved success in natural language processing tasks, such as sentiment analysis [45] and machine translation [27]. The typical attention mechanism only pays attention to wordlevel dependencies and assigns weights so that the model could highlight key elements of sentences [18]. Further, the hierarchical attention mechanism [47] uses twolayer attention, which is successively applied at the word level and sentence level to generate the document representation with rich semantics. Besides, Vaswani et al. [38] proposes a selfattention mechanism to deal with the increasing length of text. Selfattention calculates associations between words in a sentence directly. Previous work [16, 41] has shown that topic information could improve the semantic representation of text with the help of attention mechanisms. Nevertheless, to our best knowledge, no relevant work has been conducted on misleading information detection, so we explore and study in this work.
Preliminaries
Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is the most commonly used generative model for topic extraction. Assuming that a document can be represented by the probability distribution over topics, and each topic can be represented by the probability distribution over words. To learn the topic better, LDA utilizes the Dirichlet distribution as prior over latent space.
LDA uses 𝜃_{d} to denote the topic distribution of a document d and z_{n} to represent a topic allocation of the word w_{n}. Thus, the generative process of documents is shown in Algorithm 1.
Here, \(Dir(\boldsymbol {\alpha }^{\boldsymbol {\prime }})\) is the Dirichlet prior distribution, \(\boldsymbol {\alpha }^{\boldsymbol {\prime }}\) signifies the hyperparameter of Dirichlet prior, and 𝜃_{d} is the topic distribution of document d sampled from Dirichlet prior. z_{n} denotes the topic allocation of each position n in the document, and w_{n} is a word randomly generate from multinomial distribution. φ_{i} is a topicword distribution of the ith topic, and \(\varphi _{z_{n}}\) is one column in the matrix. LDA infers these parameters in an unsupervised manner. After model training, we can obtain representative words with high probabilities in each topic, and these words represent the semantic meaning of each topic.
Long ShortTerm Memory
As text is sequential data, and small changes of word order will affect the meaning of the entire sentence. However, traditional feedforward neural networks cannot directly extract the word dependency of context. Thus, researchers develop sequential models such as Recurrent Neural Networks (RNN) to extract sequential and contextual features from these data [21]. The RNN comprises an input layer, a hidden layer and an output layer. However, as the length of sentences increases, the training process will appear gradient disappearance and gradient explosion. The Long ShortTerm Memory (LSTM) [19] adds a cell state to store longterm memory [13], which could deal with this problem.
Assuming that \(\textbf x_{j}\in \mathbb {R}^{D_{w}}\) represents a word embedding of the jth word in the content and D_{w} is the dimension of word embeddings. LSTM feeds in word embeddings as a sequence and calculates the hidden state \(\textbf {h}_{j}\in \mathbb {R}^{D_{h}}\) for each word, where D_{h} is the dimension of hidden states. The calculation procedure follows below equations:
where W_{f}, W_{i}, W_{C}, W_{o}, b_{f}, b_{i}, b_{C} and b_{o} are learnable parameters, and σ(⋅) is sigmoid function. Forget gate f_{j} determines the information that needs to be retained from the cell state C_{j− 1}. Input gate i_{j} controls the proportion of new information stored in the new candidate C_{j}. Lastly, LSTM constrains the hidden state of the current node through output gate o_{j}. The elaborated design of its structure enables LSTM could learn longer dependencies and better semantic representation.
Methodology
In this section, we first introduce the TopicAware BiLSTM (TABiLSTM) model. As depicted in Fig. 1, our proposed TABiLSTM could be divided into two parts: a neural topic model and a text classification model. The topic module employs a neural topic model to
discover latent topics from text corpus. The text classification module utilizes a twolayer BiLSTM network based on the TopicAware attention mechanism to detect misleading information from text.
Neural Topic Model
As shown in the left panel of Fig. 1, its structure is composed of an encoder and a decoder. (1) Encoder takes the V dimensional x_{bow} of the document as the input and transforms it into a topic distribution 𝜃 with K dimension through two fully connected layers. (2) Decoder takes the encoded topic distribution 𝜃 as the input, then reconstructs the document \(\hat {\textbf {x}}_{bow}\) with reconstruction distribution x_{re}. After decoded by the first layer, the topic embedding v_{t} is collected. Besides, to ensure the quality of extracted topics, we use the Wasserstein distance to conduct prior matching in latent topic space.
Encoder Network
For each document d = {w_{1},w_{2},...,w_{m}} in the corpus C_{d} = {d_{1},d_{2},...,d_{n}}, the encoder utilizes its bagofwords representation x_{bow} as input, where the weights are calculated by TFIDF formulation:
where c_{ij} indicates the number of the word w_{i} appearing in document d_{j}, and \({\sum }_{k}c_{kj}\) is the total number of words in document d_{j}. C_{d} indicates the total number of documents in the corpus, and \(\left \left \{ j:w_{i}\in d_{j}\right \}\right \) represents the number of documents containing word w_{i}.
where \(x_{bow}^{(i)}\) refers to the semantic relevance of the ith word in the vocabulary in document d_{j}.
According to Eqs. 7 and 8, each document could be represented as \(\textbf x_{bow}\in \mathbb {R}^{V}\), where V indicates the vocabulary size.
The encoder firstly maps x_{bow} into the D_{s}dimensional semantic space through following transformation:
where \(\textbf {W}_{s}\in \mathbb {R}^{D_{s}\times V}\) and \(\textbf b_{s}\in \mathbb {R}^{D_{s}}\) are the weight matrix and bias term of the fully connected layer, h_{s} is the hidden state normalized by batch normalization BN(⋅), leak denotes the hyperparameter of LeakyReLU activation, and o_{s} represents the output of the layer.
Subsequently, the encoder projects the output vector o_{s} into a Kdimensional documenttopic distribution 𝜃_{e}:
where \(\textbf {W}_{o}\in \mathbb {R}^{K\times D_{s}}\) and \(\textbf b_{o}\in \mathbb {R}^{K}\) are the weight matrix and bias term of the fully connected layer, 𝜃_{e} denotes the topic distribution corresponding to the input x_{bow} and the kth (k ∈{1,2,...,K}) dimension \(\theta _{e}^{(k)}\) means the proportion of kth topic in the document.
We add noise to documenttopic distribution to draw more consistent topics. We randomly sample a noise vector 𝜃_{n} from the Dirichlet prior and merge it with 𝜃_{e}. The calculation is defined as:
where η ∈ [0,1] denotes the mixing proportion of noise.
The encoder transforms the bagofwords representation into topic distribution which perceives the semantic information in latent space.
Decoder Network
The decoder takes the topic distribution 𝜃 as input. And then, two fully connected layers reconstruct the document’s word representation \(\hat {\textbf {x}}_{bow}\). After the transformation of first layer, v_{t} serves as the topic embedding of the input document and is provided for the attention mechanism.
The decoder firstly transforms the topic distribution 𝜃 into the D_{t}dimensional topic embedding space:
where \(\textbf {W}_{t}\in \mathbb {R}^{D_{t}\times K}\) and \(\textbf b_{t}\in \mathbb {R}^{D_{t}}\) are the weight matrix and bias of the fully connected layer, h_{t} is the hidden vector normalized by batch normalization BN(⋅). The v_{t} is activated by the LeakyReLU and then used in TopicAware attention mechanism.
Subsequently, the decoder transforms the hidden vector h_{t} into V dimensional reconstruction distribution:
where \(\textbf {W}_{r}\in \mathbb {R}^{V\times D_{t}}\) and \(\textbf b_{r}\in \mathbb {R}^{V}\) are the weight matrix and bias, and x_{re} is the reconstruction distribution.
The decoder is an essential part of the neural topic model. After model training, it could generate the words corresponding to each topic. We input onehot vectors into the decoder to obtain the word distribution of each topic. Here, we use 10 words with the highest probability of each topic to represent its semantic meaning. Based on the topic distribution and the semantics of topics, interpretable wordlevel information could be provided for classifying documents in the detection process.
Prior Distribution Matching
Since the Dirichlet distribution is commonly regarded as the prior of multinomial distribution, choosing this prior has substantial advantages [40]. To match the encoded topic distribution to Dirichlet prior, we add a regularizer in TABiLSTM. Thus, the training process minimizes the regularization term based on the Maximum Mean Discrepancy (MMD) [15] to reduce the Wasserstein distance, which measures the divergence between the topic distribution 𝜃 and randomly samples \(\boldsymbol {\theta }^{\boldsymbol {\prime }}\) from prior.
Regarding the kernel function is \(\boldsymbol {\mathrm {k}}:{\Theta }\times {\Theta }\rightarrow \mathfrak {R}\), the MMD based regularizer could be defined as:
where \({\mathscr{H}}\) is the Reproducing Kernel Hilbert Space (RKHS) of realvalued functions mapping Θ to \(\mathfrak {R}\). k(⋅,⋅) implies the kernel function of this space, and k(𝜃,⋅) maps 𝜃 to the features on the highdimensional space.
As distributions in the latent space are matched with the Dirichlet prior on the simplex, we choose the information diffusion kernel [24] as the kernel function. This function is susceptible to points near the simplex boundary and has better effects on sparse data. The detailed calculation equation is:
When performing distribution matching, we employ the Dirichlet distribution, \(\boldsymbol {\alpha }^{\boldsymbol {\prime }}\) means hyperparameter, then \(\boldsymbol {\theta }^{\boldsymbol {\prime }}\) can be sampled by the following equations:
where \({\theta }^{\prime (i)}\) denotes the value of the ith dimension of \(\boldsymbol {\theta }^{\boldsymbol {\prime }}\), \({\alpha ^{\prime }}^{(i)}\) means the hyperparameter of the ith dimension of the Dirichlet distribution, \(\boldsymbol {\theta }^{\boldsymbol {\prime }}\) represents a sample sampled from the Dirichlet prior, and \(\text {B}(\boldsymbol {\alpha }^{\boldsymbol {\prime }})=\frac {{\prod }_{i=1}^{K}{\Gamma }({\alpha ^{\prime }}^{(i)})} {\Gamma \left ({\sum }_{i=1}^{K}{\alpha ^{\prime }}^{(i)}\right )}\).
Given M encoded samples and M samples sampled from Dirichlet prior, MMD could be calculated by the following unbiased estimation:
where \(\{ \boldsymbol {\theta }_{1},\boldsymbol {\theta }_{2},...,\boldsymbol {\theta }_{M}\}\sim Q_{{\Theta }}\) are the samples collected from the encoder, and Q_{Θ} is the encoded distribution of samples. \(\{ \boldsymbol {\theta }_{1}^{\boldsymbol {\prime }},\boldsymbol {\theta }_{2}^{\boldsymbol {\prime }},...,\boldsymbol {\theta }_{M}^{\boldsymbol {\prime }}\}\sim P_{{\Theta }}\) are sampled from the prior distribution P_{Θ}.
Text Classification Model
In this subsection, we will introduce the text classification module. As depicted in the right panel of Fig. 1, we utilize a twolayer BiLSTM based on the TopicAware attention mechanism. Because of the complex context of misleading information, we incorporate corpuslevel topic features by this mechanism to obtain richer semantic representation. Then, we use a classifier with two fully connected layers to detect misleading information.
BiLSTM
Bagofwords representation is sparse, and the typical solution approach to the sparsity problem is computational intelligence [46] like word embedding technology. Word2vec [30] and GloVe [32] utilize words as the smallest unit for training, while the fastText [4] splits words into ngram subwords to construct vectors.
Considering that there are many outofvocabulary words in misleading information, we use the embedding layer initialized by the pretrained fastText. Suppose the word sequence of a document d = {w_{1},w_{2},...,w_{m}}, w_{i} represents the ith word in the content. After transforming each word to a onehot vector, the embedding layer could map words to their corresponding vectors \(\textbf x_{embed} \in \mathbb {R}^{D_{w}}\), where D_{w} is the dimension of embedding space.
Then, we utilize a twolayer BiLSTM to extract semantic features, and each layer contains bidirectional LSTM units. This bidirectional structure implements the semantic contextual representation of misleading information. The network takes x_{embed} in the order of the content as input and gets each word’s hidden state. If the definition of LSTM unit is simplified as LSTM(⋅), the hidden state \(\textbf {h}^{\prime }\) of each word could be calculated by:
where \(\textbf {h}_{f1},\textbf {h}_{f2} \in \mathbb {R}^{D_{h}}\) are vectors calculated by the forward LSTM, and \(\textbf {h}_{b1},\textbf {h}_{b2} \in \mathbb {R}^{D_{h}}\) are vectors calculated by the backward LSTM. \(\textbf {h}^{\boldsymbol {\prime }} \in \mathbb {R}^{2\times D_{h}+D_{w}}\) is the hidden state that combines the word embedding and the bidirectional LSTM.
TopicAware Attention Mechanism
Generally, the attention mechanism is similar to human behavior when reading a sentence, evaluating how important each word is by giving a weight to each part [50]; the higher value is, the more important the word will be. In the typical attentionbased model, the alignment score of each word is calculated as:
where \(\textbf {q}\in \mathbb {R}^{D_{h}}\) are learnable parameters.
However, typical attention mechanisms could not utilize external information, so we design the TopicAware attention mechanism to incorporate topic features while calculating the misleading information representation. In this way, we integrate the neural topic module and the text classification module to train the entire model endtoend.
The attention weights a for each word are calculated based on the similarity between the topic embedding v_{t} and hidden states \(H = \{\textbf {h}_{1}^{\boldsymbol {\prime }}, \textbf {h}_{2}^{\boldsymbol {\prime }}, ..., \textbf {h}_{L}^{\boldsymbol {\prime }}\}\) in the last layer of BiLSTM, where L represents the max sentence length in batch.
Specifically, TABiLSTM counts the attention weight a_{i} based on the alignment score between the hidden state \(\textbf {h}_{i}^{\boldsymbol {\prime }}\) and the topic embedding v_{t}, where i = {1,2,...,L}. We set D_{t} = D_{h} and use the following equation to calculate the alignment score:
where \(\textbf {W}_{a}\in \mathbb {R}^{D_{h}\times D_{h}}\) and \(\textbf b_{a}\in \mathbb {R}^{D_{h}}\) are learnable parameters. The larger the value of \(f(\textbf {h}^{\boldsymbol {\prime }},\textbf v_{t})\), the greater the probability of misleading information implied by the corresponding word. Then, the document representation could be summarized based on the alignment scores above:
where a^{(i)} is the weight of the hidden state \(\textbf {h}_{i}^{\boldsymbol {\prime }}\) of the ith word, and \(\textbf v_{d}\in \mathbb {R}^{D_{h}}\) contains both semantics of hidden states and topic information embedded by the neural topic model.
Classifier
In this paper, the text which contains misleading information is taken as a positive example. We apply two fully connected layers and a sigmoid activation function to convert the document representation v_{d} into the probability for classification. Therefore, the higher value of the output, the more possible this document containing misleading information. The prediction process could be defined as:
where \(\textbf {W}_{d}\in \mathbb {R}^{D_{m}\times D_{h}}\), \(\textbf b_{d}\in \mathbb {R}^{D_{m}}\), \(\textbf {W}_{c}\in \mathbb {R}^{D_{m}}\) and \(b_{c}\in \mathbb {R}\) are learnable parameters, and \(\hat {y}\) is the predicted probability.
Training Objective
In multitask learning framework, models are optimized for multiple objectives jointly. Our proposed framework mainly has two training objectives: neural topic modeling objective and misleading information detection objective.
For the neural topic modeling, its objective includes the reconstruction term and the MMD based regularization term. It is defined as follows:
where c(x_{bow},x_{re}) is the reconstruction loss, \(x_{bow}^{(i)}\) denotes the weight of the ith word in the vocabulary, and \(x_{re}^{(i)}\) denotes the probability of the ith word in reconstruction distribution. In our implementation, we follow WLDA and multiply a scaling factor \(\mu =1/(l\log V)\) to balance the two terms, where l indicates the average sentence length in each batch and V indicates the vocabulary size.
For classification objective, we measure the binary crossentropy between the target label and the predicted output:
where y_{i} is the ground truth, and \(\hat {y_{i}}\) represents the predicted probability of the ith document. N means the total number of document in the corpus.
To balance the two task specific objectives, we adopt a dynamic strategy to control the weights of objectives above. The neural topic model is mainly concerned in the early stage, and then we pay more attention to train the classification objective. Thus, the total training objective is formed as:
where λ is a hyperparameter that dynamically balances the two objectives.
We set λ to a slight value in the early stage, allowing the framework to train neural topic model preferentially. Later, we change λ to 1, shifting the focus to multitask learning, and train the classifier and the neural topic model jointly.
Experiments and Results Analysis
Experimental Setup
Datasets
We conduct experiments on three public datasets about misleading information to evaluate the effectiveness of the proposed TABiLSTM model.
Enron Spam
[28] is an English public spam dataset compiled in 2006. Ham emails are collected from the mailboxes of six employees in Enron Corporation. Spam messages are obtained from four sources: SpamAssassin corpus, Honeypot project, spam collection of Bruce Guenter, and spam collected by third parties. These emails were sent and received between 2001 and 2005. The dataset consists of six subdatasets, which are combined into a whole dataset for experiments.
2007 TREC Public Spam
[9]. The Text Retrieval Conference (TREC) is a series of seminars, which mainly focuses on the problems and challenges in information retrieval research. The 2007 TREC conference held a spam filtering competition and published this dataset. The dataset includes complete mail information such as sending and receiving addresses, time, HTML code. In the experiments, we retain content in the main body and ignore other information.
WebisClickbait17
[33] contains a total of 19,538 Twitter posts with links from 27 major news publishers in the United States. These posts were published between November 2016 and June 2017. Five annotators from Amazon Mechanical Turk marked whether articles in these links were misleading information. We use the content of articles linked in the post for detection.
Due to noisy data such as blanks and duplicate documents in three datasets, the statistics of preprocessed datasets are listed in Table 2. We arrange 2/3 of the data as the training set and 1/3 of the data as the test set.
Model Configuration
In the experiments, all datasets use package enchant to check the spelling of words. Each word is reverted to base form with no inflectional suffixes by the en_core_web_lg model of package spacy. We utilize package gensim to obtain the word embedding matrix and initialize the embedding layer.
For the neural topic model, we set the number of topics K to 50 and the dimension D_{s} of the fully connected layer in the encoder to 256. The dimension D_{t} of the topic embedding is equal to the dimension D_{h} of the hidden state \(\textbf {h}^{\boldsymbol {\prime }}\). We make Dirichlet prior as sparse as possible and set the Dirichlet hyperparameter \(\boldsymbol {\alpha }^{\boldsymbol {\prime }}\) to 0.001. The proportion of noise η that adds to topic distribution is defined as 0.1.
For text classification model, we apply 300dimensional pretraining fastText word embeddings [14], that is, D_{w} is set to 300. The dropout of the BiLSTM layer is 0.3, and the dimension D_{m} in the classifier is 64. The weight matrixes in BiLSTM are initialized by orthogonal initialization, and the parameters in the TopicAware attention mechanism are initialized by uniform initialization.
During model training, the hyperparameter λ is set to 1e8 initially, and when the training reaches the last 20 Epochs, λ is set to 1. Adam optimizer with a learning rate of 1e4 to train the parameters of the neural topic model and with a learning rate of 5e5 to train other parameters. The batch size is 16. The computer CPU is Intel Xeon (Skylake) Platinum 8163, and the operating system is Ubuntu 20.04 64bit. All models are implemented with PyTorch and run on an NVIDIA V100 32G graphic card.
Baselines
We choose Naive Bayes, Support Vector Machine, Decision Tree, Random Forest four machine learning models for comparison.
Naive Bayes
[28] is a probabilistic model. By learning the joint probability distribution of the input and output of the training data, the model computes the label with the largest posterior probability of the predicted data.
SVM
[8] is a linear binary classification model defined in the feature space. It uses a kernel function to find a hyperplane to separate the two categories, and maximizes the interval between the data and the plane.
Decision Tree
[6] adopts a tree structure and uses layered inferences on the data to achieve the final classification, so it has good interpretability.
Random Forest
[5] is an ensemble learning method containing multiple decision trees. The model trains each decision tree independently, and the result is determined by the category with the most output of decision trees.
Besides, we also compare our model with following deep learningbased baselines.
BiLSTM
uses a BiLSTM network without attention mechanism. The hidden state of words in the document is averaged as the classifier’s input.
AttentionBiLSTM
uses a BiLSTM network based on a traditional attention mechanism and inputs the classifier after the weighted summation of each word’s hidden state.
In the aspect of topic modeling, we compare our model with the following neural topic models.
LDA
^{Footnote 1} [3] extracts topics based on the cooccurrence information of words in the document. We use package gensim to implement this model.
NVDM
^{Footnote 2} [29] comprises an encoder network and a decoder network, inspired by the variational autoencoder based on Gaussian prior distribution.
WLDA
^{Footnote 3} [31] is the prototype of our model, which uses Wasserstein autoencoder and Dirichlet prior distribution to mine topic information.
BAT
[44] applies bidirectional adversarial training with Dirichlet prior for neural topic modeling.
The last three neural topic models mentioned above adopt a neural network structure similar to our model.
Evaluation Metrics
In the experiments, we mainly evaluate the classification performance of the text classification model and the topic quality of the neural topic model.
For classification, we compare three widely used performance metrics: accuracy, precision, and F1score. Accuracy refers to the proportion of correctly classified samples to the total number. The calculation is:
where N is the total number of samples, and \(\mathbb {I}(\cdot )\) depicts the indicator function. When ⋅ is true, the function equals 1; otherwise, it is equal to 0. In binary classification, we generally divide the combination of predicted labels and ground truths into four types, namely True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). True or False means whether the prediction is correct, Positive or Negative means whether the forecast result is a positive or negative sample. These four categories respectively correspond to the number of samples that meet the condition, so the sum of four values equals N. Based on the above, the definition of precision is:
Precision is the number of correct labels divided by the number of all predicted positive results, and recall is the fraction of true positive samples predicted to be positive. So the precision and recall are a set of contradictory measures. To comprehensively consider the precision and recall metrics, we also evaluate the effectiveness with the F1score. The definition is below:
Under the same experimental conditions, the higher above metrics, the better classification performance.
For topic quality, we utilize two standard metrics C_{V} and C_{A} of topic coherence ^{Footnote 4}[34]. Here we choose 10 representative words for each topic as word sets and respectively compute C_{V} to measure semantical support for one word in each set. Variously, C_{A} compares pairs of single words in each topic’s set to evaluate the coherence between words. To this end, we apply the two metrics to quantify the quality of topic modeling comprehensively.
Results and Analysis
In this section, we present the experimental results and corresponding analysis of proposed TABiLSTM model in terms of classification performance and topic quality.
Classification Performance
Table 3 lists the results of classification performance on three used public datasets compared with different baselines. We could observe that the TABiLSTM model could obtain better results in accuracy, precision and F1score.
Specifically, the bagofwords representation limits the traditional machine learning approaches. The precision of Random Forest on the Clickbait17 dataset is higher because the model only selects confirmed positive samples to minimize the number of FP. Therefore, the accuracy of Random Forest is not high, and the F1score is lower than other approaches.
Moreover, we conduct ablation study by comparing BiLSTM and AttentionBiLSTM to verify the outperforming of the TopicAware attention mechanism. We could observe that the results are better than those of machine learningbased approaches, indicating that richer semantic feature representation, especially context information, could improve classification performance. Compared with the BiLSTM, the results of AttentionBiLSTM show slight improvements, indicating that the attention mechanism assigns more weights to specific words to provide a more suitable document representation.
Furthermore, in the comparison of AttentionBiLSTM and TABiLSTM, we observe that accuracy increases 0.64%, 1.12%, 3.11% and F1score increases 0.63%, 0.99%, 4.95% for the latter on the three datasets, respectively. The significant improvements show that TopicAware attention mechanism could incorporate topic information into classification module. Moreover, the topic information could indeed help TABiLSTM to provide more suitable representations for misleading information detection.
Topic Quality Comparison
The calculation of attention mechanism often incorporates supervision signal from a document, which will be helpful for mining latent semantic patterns in topic modeling procedure. Thus, we also evaluate the quality of topics in this subsection. Table 4 presents the results of different topic coherence metrics C_{A} and C_{V} comparing with other topic modeling baselines on three datasets.
Compared with the topics extracted by WLDA on Enron Spam dataset, the C_{A} of TABiLSTM has increased by 5.81%, and the C_{V} metric has risen by 11.53%. On the 2007 TREC dataset, C_{A} is almost the same as the WLDA, but the C_{V} has increased by 13%. We also present the comparison with BAT. It obtains slightly higher than WLDA and LDA on Clickbait17, but our model improves C_{A} and C_{V} by 2.31% and 3.06%.
Ignoring NVDM with poor performance, Table 5 lists the top10 representative words with the highest probability for each topic on three datasets. Thus, we could compare the quality of performance intuitively. Generally, compared with other models, we could realize that the topics generated by TABiLSTM have fewer irrelevant words and higher semantic coherence.
The topic words of NVDM are not very consistent because it employs Gaussian prior to mimic Dirichlet in topic distribution space. As the proposed TABiLSTM utilizes Dirichlet as prior distribution in topic space, it could obtain coherent topics than NVDM. Meanwhile, the supervision signal also helps the TABiLSTM to surpass LDA, WLDA and BAT in topic modeling evaluation.
HyperParameter Analysis
To further validate the robustness of TABiLSTM, we conduct hyperparameter analysis in this subsection. Concretely, parameter analysis on three parameters (the number of topics K, the dimension of hidden states \(\textbf {h}^{\boldsymbol {\prime }}\) and the proportion of noise η) has been carried out.
Firstly, the number of topics K is set to 30, 50, 80 and 100, respectively. The quantitative results on three datasets are reported in Table 6 and visualized in Fig. 2.
For Enron Spam and 2007 TREC datasets, we could observe that TABiLSTM performs fairly stable on three metrics. For Clickbait17 dataset, the classification performance is more sensitive to changes of K, which may be caused by the complicity of the dataset. It is worth mentioning that optimal numbers of topics over datasets are different (50 on Enron Spam, 80 on 2007 TREC and 50 on Clickbait17). If this number is too large, the model is not interpretable, and if the number is too small, the model training will be negatively affected [12]. Thus, we set the number of topics K to 50 in our experiments.
Similarly, we conduct parameter analysis on the dimension of hidden states \(\textbf {h}^{\boldsymbol {\prime }}\). It has been set to 25, 50, 75, 100 and 150 respectively. And the corresponding statistics are listed in Table 7. By comparing the results, we could observe that simple models perform better on Enron Spam and 2007 TREC datasets. While dealing with Clickbait17, classification performance improves with the increasing of model complexity. This may be also caused by the complexity of Clickbait17 dataset which needs a more complicated model to fit the data.
We further investigate the impact of different proportions of noise η on the performance. In detail, we compute the metrics of classification and topic modeling separately with five proportion settings [0,0.1,0.2,0.3,0.4]. The detailed comparison is shown in Table 8. It can be concluded that adding a proper proportion of noise to the topic distribution upgrades the quality of topic modeling on all datasets. However, not the optimal parameter for the topic mining has the same consequence on classification performance. Topic coherence is better when the proportion is set to 0.1 or 0.2, while less noise is helpful for the TopicAware attention mechanism to preserve topic features and prediction. Hence we set the proportion of noise to 0.1 for better comprehensive results in the experiments.
Case Study and Visualization
To validate that proposed TABiLSTM could indeed improve the model interpretability, we conduct case study and visualization in this subsection.
Figure 3a shows an advertising email for an online pharmacy in the Enron Spam dataset. As Topic 8 represents drugs, we could infer that this email may discuss related topics. Also, we could find various drug names appeared in its text content. Likewise, Fig. 3b depicts a web page content from Clickbait17 which entices people to buy cosmetics. We can also find relevant words from Topic 15 and Topic 45, such as ‘carpet’, ‘fashion’, ‘beauty’, ‘makeup’.
Thus, the above two examples show that corpuslevel topic relatedness could really improve model interpretability.
Conclusion
In this paper, we proposed the TopicAware BiLSTM (TABiLSTM) model, an endtoend framework. TABiLSTM contains a neural topic model and a text classification model, which explores corpuslevel topic relatedness to enhance misleading information detection. Meanwhile, the supervision signal could be incorporated into topic modeling process to further improve the topic quality. Experiments on three English misleading information datasets demonstrate the superiority of TABiLSTM compared with baseline approaches. Additionally, we analyze multiple hyperparameters in detail and select specific topic examples for visualization. More recently, classification and topic modeling on short texts are still challenging tasks. Our future study would pay more attention to detect misleading information from the short text on social media platforms.
References
 1.
Agrawal A (2016) Clickbait Detection Using Deep Learning. In: 2016 2nd International Conference on Next Generation Computing Technologies (NGCT). IEEE, pp 268–272
 2.
Biyani P, Tsioutsiouliklis K, Blackmer J (2016) 8 Amazing Secrets for Getting More Clicks: Detecting Clickbaits in News Streams Using Article Informality. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 30
 3.
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3 (Jan):993–1022
 4.
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Transa Assoc Comput Linguist 5:135–146
 5.
Breiman L (2001) Random Forests. Mach Learn 45:5–32
 6.
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and Regression Trees. CRC press
 7.
Chakraborty A, Paranjape B, Kakarla S, Ganguly N (2016) Stop clickbait: Detecting and preventing clickbaits in online news media. In: 2016 IEEE/ACM International conference on advances in social networks analysis and mining (ASONAM). IEEE, pp. 9–16
 8.
Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):1–27
 9.
Cormack GV (2007) TREC 2007 Spam Track Overview. In: In The Sixteenth Text RETrieval Conference (TREC 2007). Proceedings
 10.
Elhadad MK, Li KF, Gebali F (2020) Detecting Misleading Information on COVID19. IEEE Access 8:165201–165215
 11.
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
 12.
Gao H, Qin X, Barroso RJD, Hussain W, Xu Y, Yin Y (2020) Collaborative Learningbased Industrial IoT API Recommendation for Softwaredefined Devices. The Implicit Knowledge Discovery Perspective. IEEE Transactions on Emerging Topics in Computational Intelligence
 13.
Gao H, Huang W, Duan Y (2021) The Cloudedgebased Dynamic Reconfiguration to Service Workflow for Mobile Ecommerce Environments: A QoS Prediction Perspective. ACM Trans Internet Technol (TOIT) 21(1):1–23
 14.
Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. in: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA)
 15.
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A (2012) A kernel TwoSample test. J Mach Learn Res 13(25):723–773
 16.
Gui L, Jia L, Zhou J, Xu R, He Y (2020) MultiTask Learning with mutual learning for joint sentiment classification and topic detection. IEEE Trans Knowl Data Eng:1–1
 17.
HaiTao Z, JinYuan C, Yao X, Sangaiah AK, Jiang Y, Zhao CZ (2018) Clickbait convolutional neural network. Symmetry 10(5):138
 18.
Han X, Li B, Wang Z (2019) An attentionbased neural framework for uncertainty identification on social media texts. Tsinghua Sci Technol 25(1):117–126
 19.
Hochreiter S, Schmidhuber J (1997) Long shortterm memory. Neural Comput 9(8):1735–1780
 20.
Hu X, Wang R, Zhou D, Xiong Y (2020) Neural topic modeling with cycleconsistent adversarial training. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 9018–9030
 21.
Huang Y, Xu H, Gao H, Ma X, Hussain W (2021) Ssur: an approach to optimizing virtual machine allocation strategy based on user requirements for cloud data center. IEEE Trans Green Commun Netw 5(2):670–681
 22.
Jain G, Sharma M, Agarwal B (2019) Spam detection in social media using convolutional and long short term memory neural network. Ann Math Artif Intell 85(1):21–44
 23.
Kumar V, Khattar D, Gairola S, Kumar Lal Y, Varma V (2018) Identifying Clickbait: A MultiStrategy Approach Using Neural Networks. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp 1225–1228
 24.
Lafferty J, Lebanon G (2002) Information Diffusion Kernels. In: Proceedings of the 15th International Conference on Neural Information Processing Systems, pp 391–398
 25.
Li G, Peng S, Wang C, Niu J, Yuan Y (2018) An energyefficient data collection scheme using denoising autoencoder in wireless sensor networks. Tsinghua Sci Technol 24(1):86–96
 26.
Liu L, Lu Y, Luo Y, Zhang R, Itti L, Lu J (2016) Detecting “Smart” Spammers on Social Network: A Topic Model Approach. In: Proceedings of the NAACL Student Research Workshop, pp 45–50
 27.
Luong MT, Pham H, Manning CD (2015) Effective Approaches to Attentionbased Neural Machine Translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 1412–1421
 28.
Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive BayesWhich naive bayes? In: CEAS, vol 17, pp 28–69. Mountain View, CA
 29.
Miao Y, Yu L, Blunsom P (2016) Neural variational inference for text processing. In: International Conference on Machine Learning. PMLR, pp 1727–1736
 30.
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed Representations of Words and Phrases and their Compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, vol 2, pp 3111–3119
 31.
Nan F, Ding R, Nallapati R, Xiang B (2019) Topic modeling with wasserstein autoencoders. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics. Florence, Italy, pp 6345–6381
 32.
Pennington J, Socher R, Manning CD (2014) GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
 33.
Potthast M, Gollub T, Komlossy K, Schuster S, Wiegmann M, Fernandez EPG, Hagen M, Stein B (2018) Crowdsourcing a large corpus of clickbait on twitter. In: Proceedings of the 27th International Conference on Computational Linguistics, pp 1498–1507
 34.
Röder M, Both A, Hinneburg A (2015) Exploring the Space of Topic Coherence Measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp 399–408
 35.
Song L, Lau RYK, Kwok RCW, Mirkovski K, Dou W (2017) Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection. Electron Commerce Res 17(1):51–81
 36.
Srivastava A, Sutton C (2017) Autoencoding variational inference for topic models. In: International Conference on Learning Representations
 37.
Tolstikhin I, Bousquet O, Gelly S, Schoelkopf B (2018) Wasserstein AutoEncoders. In: International Conference on Learning Representations
 38.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Polosukhin I (2017) Attention is All you Need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 6000–6010
 39.
Villani C (2003) Topics in Optimal Transportation, vol 58. American Mathematical Society
 40.
Wallach HM, Mimno D, McCallum A (2009) Rethinking LDA: Why Priors Matter. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, pp 1973–1981
 41.
Wang C, Wang B (2020) An Endtoend TopicEnhanced SelfAttention Network for Social Emotion Classification. In: Proceedings of The Web Conference, vol 2020, pp 2210–2219
 42.
Wang R, Deyu Z, He Y (2019a) Open Event Extraction from Online Text using a Generative Adversarial Network. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp 282–291
 43.
Wang R, Zhou D, He Y (2019b) ATM: AdversarialNeural Topic Model. Inf Process Manag 56(6):102098
 44.
Wang R, Hu X, Zhou D, He Y, Xiong Y, Ye C, Xu H (2020) Neural topic modeling with bidirectional adversarial training. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics. Online, pp 340–350
 45.
Wang Y, Huang M, Zhu X, Zhao L (2016) Attentionbased LSTM for Aspectlevel Sentiment Classification. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 606–615
 46.
Yang X, Zhou S, Cao M (2020) An approach to alleviate the sparsity problem of hybrid collaborative filtering based recommendations: The productattribute perspective from user reviews. Mob Netw Appl 25(2)
 47.
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 1480–1489
 48.
Yin Y, Cao Z, Xu Y, Gao H, Li R, Mai Z (2020) Qos Prediction for Service Recommendation With Features Learning in Mobile Edge Computing Environment. IEEE Trans Cogn Commun Netw 6(4):1136–1145
 49.
Zhou D, Hu X, Wang R (2020) Neural topic modeling by incorporating document relationship graph. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 3790–3796
 50.
Zhu Y, Zhang W, Chen Y, Gao H (2019) A novel approach to workload prediction using attentionbased lstm encoderdecoder network in cloud environment. EURASIP J Wirel Commun Netw 2019(1):1–18
Acknowledgments
This work was supported in part by the National Key Research and Development Program (2019YFB2101704 and 2018YFB0803403), National Natural Science Foundation of China (No.61872194, No.62072252 and No.62102192).
Author information
Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
S. Chang and R. Wang contributed equally to this work.
Rights and permissions
About this article
Cite this article
Chang, S., Wang, R., Huang, H. et al. TABiLSTM: An Interpretable TopicAware Model for Misleading Information Detection in Mobile Social Networks. Mobile Netw Appl (2021). https://doi.org/10.1007/s1103602101847w
Accepted:
Published:
Keywords
 Misleading information detection
 Deep representation learning
 Neural topic model
 Attention mechanism
 Mobile social networks