COVID Fake News Detection with a Very Simple Logistic Regression
This time, we are going to create a simple logistic regression model to classify COVID news to either true or fake, using the data I collected a while ago.
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo17.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo16.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo15.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo12.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo11.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo10.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo10.pdf
https://www.servier.ie/sites/default/files/webform/Videos-mx-fight-vivo9.pdf
https://www.servier.ie/sites/default/files/webform/Videos-fr-awards5.pdf
https://www.servier.ie/sites/default/files/webform/Videos-fr-awards3.pdf
https://www.servier.ie/sites/default/files/webform/EnDirect.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-mx-fight-vivo17.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-mx-fight-vivo16.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-mx-fight-vivo14.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-mx-fight-vivo11.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-fr-awards7.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-fr-awards5.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Videos-fr-awards4.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/Endirecto.pdf
https://www.mcvvftrentino.it/sites/default/files/webform/moto%20club/EnDirect.pdf
https://wisem.rutgers.edu/sites/default/files/webform/Videos-mx-fight-vivo9.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-Miss-v-France-fr-direct33.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-Miss-v-France-fr-direct31.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-Miss-v-France-fr-direct30.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-jp-ekdin6.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-jp-ekdin5.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-jp-ekdin4.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-fr-awards3.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-fr-awards2.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/video-jp-ekdin1.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-vivo12.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-05.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-vivo-5.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-vivo7.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-vivo-4.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/canelo-vs-smith-cuando-es-vivo10.html
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-jp-marathan6.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-jp-marathan5.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-jp-marathan3.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-jp-marathan2.pdf
https://www.construct.ee/sites/default/files/webform/quote_request/_sid_/Videos-jp-marathan1.pdf
The process is surprisingly simple and easy. We will clean and pre-process the text data, perform feature extraction using NLTK library, build and deploy a logistic regression classifier using Scikit-Learn library, and evaluate the model’s accuracy at the end.
The Data
The data set contains 586 true news and 578 fake news, almost 50/50 split. Because the data collection bias, I decided not to use “source” as one of the features, instead, I will combine “title” and “text” into one feature “title_text”.
fake_news_logreg_start.py
Pre-processing
Let’s have a look an example of the title text combination:
df['title_text'][50]
Looking at the above example of title and text, they are pretty clean, a simple text pre-processing would do the job. So, we will strip off any html tags, punctuation, and make them lower case.
fake_news_logreg_preprocessing.py
The following code combines tokenization and stemming techniques together, and then apply the techniques on “title_text” later.
porter = PorterStemmer()def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]
TF-IDF
Here we transform “title_text” feature into TF-IDF vectors.
- Because we have already convert “title_text” to lowercase earlier, here we set
lowercase=False
. - Because we have taken care of and applied preprocessing on “title_text”, here we set
preprocessor=None
. - We override the string tokenization step with our combination of tokenization and stemming we defined earlier.
- Set
use_idf=True
to enable inverse-document-frequency reweighting. - Set
smooth_idf=True
to avoid zero divisions.
fake_news_logreg_tfidf.py
Logistic Regression for Document Classification
- Instead of tuning C parameter manually, we can use an estimator which is
LogisticRegressionCV
. - We specify the number of cross validation folds
cv=5
to tune this hyperparameter. - The measurement of the model is the
accuracy
of the classification. - By setting
n_jobs=-1
, we dedicate all the CPU cores to solve the problem. - We maximize the number of iterations of the optimization algorithm.
- We use
pickle
to save the model.
fake_news_logreg_model.py
Model Evaluation
- Use
pickle
to load our saved model. - Use the model to look at the accuracy score on the data it has never seen before.
fake_news_logreg_eva.py
Jupyter notebook can be found on Github. Enjoy the rest of the week.