In this session we will cover:
import requests
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
page = requests.get('https://en.wikipedia.org/wiki/Computer_security')
soup = BeautifulSoup(page.content, 'html.parser')
text = soup.text.replace('\n', ' ').split(" ")
# Word-level Occurrences
out = [w for w in text if len (w) > 4]
plt.figure(figsize=(20,5))
pd.value_counts(out).head(50).plot(kind='bar')
plt.show()
# Characeter-level Occurrences
s = soup.text.lower().replace('\n', ' ')
out = [c for c in s if c in 'abcdefghijklmnopqrstuvwxyz']
plt.figure(figsize=(20,5))
pd.value_counts(out).sort_index().plot(kind='bar')
plt.show()
TF-IDF - accounts for occurrence of a term, whilst weighting against the occurrence in the whole document set. TF-IDF is calculated as TF * IDF, where:
Example: “Heartbleed” and “Cyber”
Here, Heartbleed is scored higher than Cyber because in relation to the overall document set it is deemed of greater significance.