COVID Fake News Detection with a Very Simple Logistic Regression

Adamrosariosg
4 min readDec 19, 2020

--

This time, we are going to create a simple logistic regression model to classify COVID news to either true or fake, using the data I collected a while ago.

https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo17.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo16.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo15.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo12.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo11.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo10.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo10.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo9.pdf
https://www.servier.ie/sites/default/files/webform/Videos-fr-awards5.pdf
https://www.servier.ie/sites/default/files/webform/Videos-fr-awards3.pdf
https://www.servier.ie/sites/default/files/webform/EnDirect.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-mx-fight-vivo17.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-mx-fight-vivo16.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-mx-fight-vivo14.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-mx-fight-vivo11.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-fr-awards7.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-fr-awards5.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-fr-awards4.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Endirecto.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/EnDirect.pdf
https://wisem.rutgers.edu/sites/default/files/webform/Videos-mx-fight-vivo9.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-Miss-v-France-fr-direct33.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-Miss-v-France-fr-direct31.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-Miss-v-France-fr-direct30.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-jp-ekdin6.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-jp-ekdin5.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-jp-ekdin4.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-fr-awards3.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-fr-awards2.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-jp-ekdin1.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-vivo12.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-05.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-vivo-5.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-vivo7.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-vivo-4.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-vivo10.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-jp-marathan6.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-jp-marathan5.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-jp-marathan3.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-jp-marathan2.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-jp-marathan1.pdf

The process is surprisingly simple and easy. We will clean and pre-process the text data, perform feature extraction using NLTK library, build and deploy a logistic regression classifier using Scikit-Learn library, and evaluate the model’s accuracy at the end.

The Data

The data set contains 586 true news and 578 fake news, almost 50/50 split. Because the data collection bias, I decided not to use “source” as one of the features, instead, I will combine “title” and “text” into one feature “title_text”.

fake_news_logreg_start.py

Pre-processing

Let’s have a look an example of the title text combination:

df['title_text'][50]

Looking at the above example of title and text, they are pretty clean, a simple text pre-processing would do the job. So, we will strip off any html tags, punctuation, and make them lower case.

fake_news_logreg_preprocessing.py

The following code combines tokenization and stemming techniques together, and then apply the techniques on “title_text” later.

porter = PorterStemmer()def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]

TF-IDF

Here we transform “title_text” feature into TF-IDF vectors.

  • Because we have already convert “title_text” to lowercase earlier, here we set lowercase=False.
  • Because we have taken care of and applied preprocessing on “title_text”, here we set preprocessor=None.
  • We override the string tokenization step with our combination of tokenization and stemming we defined earlier.
  • Set use_idf=True to enable inverse-document-frequency reweighting.
  • Set smooth_idf=True to avoid zero divisions.

fake_news_logreg_tfidf.py

Logistic Regression for Document Classification

  • Instead of tuning C parameter manually, we can use an estimator which is LogisticRegressionCV.
  • We specify the number of cross validation folds cv=5 to tune this hyperparameter.
  • The measurement of the model is the accuracy of the classification.
  • By setting n_jobs=-1, we dedicate all the CPU cores to solve the problem.
  • We maximize the number of iterations of the optimization algorithm.
  • We use pickle to save the model.

fake_news_logreg_model.py

Model Evaluation

  • Use pickle to load our saved model.
  • Use the model to look at the accuracy score on the data it has never seen before.

fake_news_logreg_eva.py

Jupyter notebook can be found on Github. Enjoy the rest of the week.

--

--