COVID Fake News Detection with a Very Simple Logistic Regression

This time, we are going to create a simple logistic regression model to classify COVID news to either true or fake, using the data I collected a while ago.

The process is surprisingly simple and easy. We will clean and pre-process the text data, perform feature extraction using NLTK library, build and deploy a logistic regression classifier using Scikit-Learn library, and evaluate the model’s accuracy at the end.

The Data



Looking at the above example of title and text, they are pretty clean, a simple text pre-processing would do the job. So, we will strip off any html tags, punctuation, and make them lower case.

The following code combines tokenization and stemming techniques together, and then apply the techniques on “title_text” later.

porter = PorterStemmer()def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]


  • Because we have already convert “title_text” to lowercase earlier, here we set lowercase=False.
  • Because we have taken care of and applied preprocessing on “title_text”, here we set preprocessor=None.
  • We override the string tokenization step with our combination of tokenization and stemming we defined earlier.
  • Set use_idf=True to enable inverse-document-frequency reweighting.
  • Set smooth_idf=True to avoid zero divisions.

Logistic Regression for Document Classification

  • We specify the number of cross validation folds cv=5 to tune this hyperparameter.
  • The measurement of the model is the accuracy of the classification.
  • By setting n_jobs=-1, we dedicate all the CPU cores to solve the problem.
  • We maximize the number of iterations of the optimization algorithm.
  • We use pickle to save the model.

Model Evaluation

  • Use the model to look at the accuracy score on the data it has never seen before.

Jupyter notebook can be found on Github. Enjoy the rest of the week.