Sometime back, WhatsApp announced that it was deleting 1.5 million accounts every month to prevent the spread of fake news. Of the millions of accounts flagged for deletion, how many are misclassified as fake news? How did WhatsApp even create and manage the process of deleting millions of fake accounts? Have you ever wondered how to detect fake news using Machine learning?

Fake news is one of the plagues in our digitally connected world. That is no exaggeration. Fake news detection is no longer limited to little squabbles – fake news spreads like wildfire and is impacting millions of people every day.

How do you deal with such a sensitive issue? Millions of articles are being churned out every day on the internet – how do you tell real from the fake? They are typically built on a story-by-story basis. Can we turn to machine learning algorithms?

In this Machine learning tutorial, we will study a process used to detect fake news from original news by using Logistic Regression technique.

As this article encompasses the use of Machine Learning algorithms like Logistic Regression, we would first provide a brief intuition of both these terms.

Machine Learning is a study of training machines to learn patterns from old data and make predictions with the new one. The computer is trained first with historical data which could be labeled or unlabeled based on the problem statement and once it performs well on the training data, it is evaluated on the test data set. Often the metrics used for prediction could be misleading and hence it is necessary to define the KPI and the metrics of evaluation beforehand keeping the business objective in mind.

Text Classification in Python

One of the applications of Natural Language Processing is text classification. It is the process by which any raw text could be classified into several categories like good/bad, positive/negative, spam/not spam, and so on. Even a news article could be classified into various categories with this method.

In this Machine learning introduction article, we would classify as fake or real using Python. There are a total of 20080 labeled messages and we need to separate them as a fake or real message. Below are the code snippets and the descriptions of each block used to build the text classification model.

The process for building a fake news detector in Python is as follows –

  • The first step for any Data Science problem is importing the necessary libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.optimizers import RMSprop
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping
%matplotlib inline

Apart from the traditional libraries like Pandas, NumPy, and so on, we have also imported the LSTM or Long Short Term Memory which is a part of the Recursive Neural Network used in Deep Learning. It is one of the most popular techniques in Deep Learning frameworks which is used across a variety of applications such as speech recognition, time series analysis, etc. We would use the architecture of Long Short Term Memory Network to classify news articles as fake or real.

  • The read_csv() method of pandas is used to look into the first five rows of our data and combining the title and text column to get final news.
data = pd.read_csv('train.csv')
Fake news detection in Python - using read_csv() method of pandas | EduGrad
data['News'] = data['title'] + data['text']
data_final = data[['News','label']]

Now, we are left with a labeled data of two columns – one with the 0 and 1 label (referring to real and fake news respectively) and the other is the textual data. Let’s visualize the dataset to see how many real and fake news are present in it. We would use the count plot functionality of the seaborn module in Python. Seaborn is built on top of Matplotlib but has a wider range of styling and interactive features.

plt.title('Fake vs Real News')


Fake news detection using Machine Learning - Using countplot() function to visualize data | EduGrad

As expected, there is more real news than that of fake news. In the next step, we would create vectors of our features and the target variable. The reason why we create vectors is that machines cannot interpret textual data and thus it needs to be converted into numbers. The sklearn module of Python has a LabelEncoder() method which encodes categorical data and assigns more weights to the greater number.

X = data_final.News
Y = data_final.label
le = LabelEncoder()
Y = le.fit_transform(Y)
Y = Y.reshape(-1,1)


The model is learned from our training set and is evaluated on the test data. We have used 85% of our initial data for the training purpose and left the remaining 15 % for testing.

X_train, X_test, Y_train, Y_test= train_test_split(X,Y,


Data Pre-processing is the most time-consuming but important part of a Machine Learning project. Some of the pre-processing techniques used in text analysis are tokenizing, normalization, and so on.

max_words = 1000
max_len = 150
tok = Tokenizer(num_words = max_words)
sequences = tok.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences, 


Once the data is pre-processed, it needs to be fed to our model to train. We would define a Recursive Neural Network to fit in the LSTM architecture.

def fake_news_detector():
    inputs = Input(name='inputs', shape = [max_len])
    layer = Embedding(max_words, 50, input_length = max_len)(inputs)
    layer = LSTM(64)(layer)
    layer = Dense(256, name = 'FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.5)(layer)
    layer = Dense(1, name = 'out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model


The model is compiled with loss function as binary_crossentropy and the metrics of evaluation as accuracy.

model = fake_news_detector()
              optimizer = RMSprop(), 
              metrics = ['accuracy'])

Implementing fake news detection model - compiling our model with loss function as binary_crossentropy() and the metrics of evaluation as accuracy | EduGrad


The training set is fit into the model.,Y_train,batch_size=128,epochs=10,


Fake news detection in Machine Learning - Fitting training set into model | EduGrad

This would be our final model because of its accuracy on the validation set. The model has been tested on the test data.

test_sequences = tok.texts_to_sequences(X_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,
                                               maxlen = max_len)


The loss and the accuracy of the test data.

accr= model.evaluate(test_sequences_matrix,Y_test)
3031/3031 [==============================] - 3s 872us/step
print("Test set \n Loss : {:0.3f} \n Accuracy: {:0.3f}".format(accr[0],accr[1]))
Test set 
  Loss : 0.574 
  Accuracy: 0.853

There are several text classification algorithms and in this context, we have used the LSTM network using Python to separate a real news article from the fake news article.

Download data set here – test.csv   AND    train.csv

Practice your code here – Jupyter Notebook

Conclusion –

Understanding, and manipulating raw data is gradually becoming a part of every organization. Thus it is necessary to know the nitty-gritty of Natural Language Processing and apply its fundamentals to several use cases such as the one shown in this article.

Explore Machine Learning Projects –

Learn to build Recommendation system in Python | Machine Learning Projects | Data science Projects | EduGradLearn Predictive Regression models in Machine Learning | Machine Learning Projects | EduGrad Build classification model in Python | Data science projects | EduGrad


Please enter your comment!
Please enter your name here