UFCFFY-15-M¶

Cyber Security Analytics¶

08: Case Study - Text Analysis and Misinformation¶

Prof. Phil Legg¶

08: Text Analytics and Fake News¶

In this session we will cover:

Scraping text, count occurrences and TF-IDF
Case study on Fake News
Advanced text analytics concepts

Scraping text, and counting Occurrences¶

In [28]:

import requests
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

page = requests.get('https://en.wikipedia.org/wiki/Computer_security')
soup = BeautifulSoup(page.content, 'html.parser')
text = soup.text.replace('\n', ' ').split(" ")

In [45]:

# Word-level Occurrences
out = [w for w in text if len (w) > 4]
plt.figure(figsize=(20,5))
pd.value_counts(out).head(50).plot(kind='bar')
plt.show()

In [44]:

# Characeter-level Occurrences
s = soup.text.lower().replace('\n', ' ')
out = [c for c in s if c in 'abcdefghijklmnopqrstuvwxyz']
plt.figure(figsize=(20,5))
pd.value_counts(out).sort_index().plot(kind='bar')
plt.show()

Term Frequency – Inverse Document Frequency¶

TF-IDF - accounts for occurrence of a term, whilst weighting against the occurrence in the whole document set. TF-IDF is calculated as TF * IDF, where:

TF(t) = number of times t appears in a document / total number of terms in document
IDF(t) = log (total number of documents / number of documents with t in)

Example: “Heartbleed” and “Cyber”

Suppose “Heartbleed” occurs 10 times within a 100 word document: TF(Heartbleed) = 10/100 = 0.1. Suppose we have 1000 documents and “Heartbleed” occurs in 10 of these: IDF(Heartbleed) = log(1000 / 10) = 2. Then, TF-IDF = TF IDF = 0.1 2 = 0.2.
Suppose now we look at the word “Cyber”, and assume it occurs 30 times within the 100 word document: TF(Cyber) = 30/100 = 0.3, but it also occurs in 750 of the 1000 documents: IDF(Heartbleed) = log(1000 / 750) = 0.124. Then, TF-IDF = TF IDF = 0.3 0.124 = 0.0372.

Here, Heartbleed is scored higher than Cyber because in relation to the overall document set it is deemed of greater significance.

Case study example - Fake News¶

Text Analytics Lab

Advanced Concepts¶

Recurrent Neural Networks (known as RNN).
- Deep learning for sequential/temporal data - 'The cat sat on the ...'.
- Long Short-Term Memory (LSTM) networks are the most popular form of RNN - resolves the 'vanishing gradient' problem.

One-hot encoding works well for smaller dictionaries - how do we deal with larger sets?
- Word2Vec provides a compact `embedding', like an autoencoder.
- Two approaches:
  - 'continuous bag of words' - given a set of words, what word fits with the set?
  - 'skip-grams' - given a single word, what set of words would fit with this?
- Doc2Vec - similar concept but for a set of documents rather than a set of words.

Are we learning a single output (vector) or a set of output (sequence)? Models can be thought of as: vec2vec, vec2seq, seq2vec, and seq2seq.
- Language translation is a good candidate for seq2seq, since the input and output may both be of variable length.

Generative Pre-trained Transformer 3 (GPT-3) was released in 2020 by OpenAI in their research paper “Language Models are Few-Shot Learners”.
- It uses 175 billion parameters in their learning model, but achieves near-human accuracy. Whilst it can do traditional text tasks like sentence completion, they demonstrate it’s effectiveness for truly understanding text, such as executing commands as described by a human – from building applications based on a written description, smart assistants that can recognise tasks and provide recommendation, and many other examples that are available online.
Two videos that explain this further: 2 minute papers and Half Ideas.
Research that led to the release of ChatGPT