UFCFFY-15-M¶

Cyber Security Analytics¶

08: Case Study - Text Analysis and Misinformation¶

Prof. Phil Legg¶

08: Text Analytics and Fake News¶

In this session we will cover:

  • Scraping text, count occurrences and TF-IDF
  • Case study on Fake News
  • Advanced text analytics concepts

Scraping text, and counting Occurrences¶

In [28]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

page = requests.get('https://en.wikipedia.org/wiki/Computer_security')
soup = BeautifulSoup(page.content, 'html.parser')
text = soup.text.replace('\n', ' ').split(" ")
In [45]:
# Word-level Occurrences
out = [w for w in text if len (w) > 4]
plt.figure(figsize=(20,5))
pd.value_counts(out).head(50).plot(kind='bar')
plt.show()
In [44]:
# Characeter-level Occurrences
s = soup.text.lower().replace('\n', ' ')
out = [c for c in s if c in 'abcdefghijklmnopqrstuvwxyz']
plt.figure(figsize=(20,5))
pd.value_counts(out).sort_index().plot(kind='bar')
plt.show()

Term Frequency – Inverse Document Frequency¶

TF-IDF - accounts for occurrence of a term, whilst weighting against the occurrence in the whole document set. TF-IDF is calculated as TF * IDF, where:

  • TF(t) = number of times t appears in a document / total number of terms in document
  • IDF(t) = log (total number of documents / number of documents with t in)

Example: “Heartbleed” and “Cyber”

  • Suppose “Heartbleed” occurs 10 times within a 100 word document: TF(Heartbleed) = 10/100 = 0.1. Suppose we have 1000 documents and “Heartbleed” occurs in 10 of these: IDF(Heartbleed) = log(1000 / 10) = 2. Then, TF-IDF = TF IDF = 0.1 2 = 0.2.
  • Suppose now we look at the word “Cyber”, and assume it occurs 30 times within the 100 word document: TF(Cyber) = 30/100 = 0.3, but it also occurs in 750 of the 1000 documents: IDF(Heartbleed) = log(1000 / 750) = 0.124. Then, TF-IDF = TF IDF = 0.3 0.124 = 0.0372.

Here, Heartbleed is scored higher than Cyber because in relation to the overall document set it is deemed of greater significance.

Case study example - Fake News¶

Text Analytics Lab

Advanced Concepts¶

  • Recurrent Neural Networks (known as RNN).
    • Deep learning for sequential/temporal data - 'The cat sat on the ...'.
    • Long Short-Term Memory (LSTM) networks are the most popular form of RNN - resolves the 'vanishing gradient' problem.
  • One-hot encoding works well for smaller dictionaries - how do we deal with larger sets?
    • Word2Vec provides a compact `embedding', like an autoencoder.
    • Two approaches:
      • 'continuous bag of words' - given a set of words, what word fits with the set?
      • 'skip-grams' - given a single word, what set of words would fit with this?
    • Doc2Vec - similar concept but for a set of documents rather than a set of words.
  • Are we learning a single output (vector) or a set of output (sequence)? Models can be thought of as: vec2vec, vec2seq, seq2vec, and seq2seq.
    • Language translation is a good candidate for seq2seq, since the input and output may both be of variable length.
  • Generative Pre-trained Transformer 3 (GPT-3) was released in 2020 by OpenAI in their research paper “Language Models are Few-Shot Learners”.
    • It uses 175 billion parameters in their learning model, but achieves near-human accuracy. Whilst it can do traditional text tasks like sentence completion, they demonstrate it’s effectiveness for truly understanding text, such as executing commands as described by a human – from building applications based on a written description, smart assistants that can recognise tasks and provide recommendation, and many other examples that are available online.
  • Two videos that explain this further: 2 minute papers and Half Ideas.

Further reading¶

  • Ahmed, H, Traore, I, Saad, S. Detecting opinion spams and fake news using text classification, Security and Privacy, 2018; 1:e9. https://doi.org/10.1001/spy2.9
  • Security Data Analytics and Visualisation - Text Analytics - more on Recommender Systems and Bayes Theorem