In this notebook, we give a brief overview of Python and Pandas, looking at the kind of tooling that you will need for the Cyber Security Analytics module. Most of the details here should already be familiar to you - hence why the detail is limited in part - so you should be able to read through and understand the operations being performed in this notebook. If there are aspects that look unfamiliar, play around with the notebook, change the code and see what happens - notebooks are a fantastic resource for modifying and trying out techniques.
For students using the UWEcyber Virtual Machine, to get started with JupyterLab, open the Terminal application and type:
For those who choose to use Windows, it is recommended that you install Anaconda as a Python environment and package management system.
It is strongly recommended that you try the code samples in your own notebook instance to fully understand the examples.
To begin - let's show a very quick example of generating some dataset and plotting this. Study the code cell below where we create two Python variables a and b:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
a = np.random.normal(1.0, 0.3, [2, 1000])
b = np.random.normal(1.8, 0.2, [2, 100])
If we wish to examine what the variable content is, we can simply type the variable name into an empty cell:
a
array([[1.43895926, 1.01081998, 0.73561982, ..., 0.27903605, 1.17242986, 1.68223149], [1.10485665, 0.58160588, 0.87423119, ..., 1.04769919, 0.74284941, 0.87881302]])
We can then plot the variables as points on a scatterplot, where we treat the data as coordinate points (where each point has two attributes, X and Y).
plt.scatter(a[0], a[1])
plt.scatter(b[0], b[1])
<matplotlib.collections.PathCollection at 0x206039e3e50>
Basic variables in Python include numerical values (integers, floats, doubles, etc.) and string values (text). Basic data structures include lists and dictionaries.
number = 1
text = 'hello_everyone'
l = [1,2,3,4,5]
d = {'name':'bob', 'value':100}
First let's look at some basics of Python and get use to manipulating data using the built in variable types.
# First some simple variables
an_integer = 12
a_floating_point_number = 18.4732
# We can do some simple maths on these variables and see the output
an_integer + a_floating_point_number
30.4732
# We can also write functions to do simple maths
def multiply(number1, number2):
return number1 * number2
multiply(an_integer, a_floating_point_number)
221.67839999999998
# We can also create text variables just as easily
a_string = 'Hello there!'
print (a_string)
Hello there!
# We can do some simple manipulation of text
my_name = 'Phil'
message = a_string + ' My name is ' + my_name
print (message)
# Including spliting sentences to corrupt a message
imposter_name = 'Dave'
s_m = message.split(" ")
new_message = ' '.join(s_m[:-1]) + ' ' + imposter_name
print (new_message)
Hello there! My name is Phil Hello there! My name is Dave
# We also have lists of data that can sort variables
fruits = ['apple','banana','orange','lemon']
print (fruits)
# We can access sets of variables using indexes
print (fruits[0:2])
print (fruits[:-1])
# We can append items to the list, and remove items from the list
fruits.append('mango')
print (fruits)
fruits.remove('banana')
print (fruits)
['apple', 'banana', 'orange', 'lemon'] ['apple', 'banana'] ['apple', 'banana', 'orange'] ['apple', 'banana', 'orange', 'lemon', 'mango'] ['apple', 'orange', 'lemon', 'mango']
# We can also create dictionary objects
# This is helpful for storing related variables about an object
person = {}
person['name'] = 'bob'
person['age'] = 23
person['height'] = 185
person['email'] = 'bob@bobmail.com'
print (person)
{'name': 'bob', 'age': 23, 'height': 185, 'email': 'bob@bobmail.com'}
# Like earlier, we could use a function to create 'person' objects
people = []
def create_person(name, age, height, email):
global people
new_person = {'name':name,
'age':age,
'height':height,
'email':email}
people.append(new_person)
create_person('bob', 23, 177, 'bob@bobmail.com')
create_person('john', 41, 185, 'john@johnmail.com')
create_person('sophie', 31, 157, 'sophie@sophiemail.com')
create_person('wendy', 19, 164, 'wendy@wendymail.com')
create_person('amanda', 29, 174, 'wendy@wendymail.com')
create_person('daisy', 42, 176, 'wendy@wendymail.com')
create_person('michael', 30, 162, 'wendy@wendymail.com')
# Here we store our person objects in our people list
# to make a group of 'persons' - a.k.a. people!
print (people)
[{'name': 'bob', 'age': 23, 'height': 177, 'email': 'bob@bobmail.com'}, {'name': 'john', 'age': 41, 'height': 185, 'email': 'john@johnmail.com'}, {'name': 'sophie', 'age': 31, 'height': 157, 'email': 'sophie@sophiemail.com'}, {'name': 'wendy', 'age': 19, 'height': 164, 'email': 'wendy@wendymail.com'}, {'name': 'amanda', 'age': 29, 'height': 174, 'email': 'wendy@wendymail.com'}, {'name': 'daisy', 'age': 42, 'height': 176, 'email': 'wendy@wendymail.com'}, {'name': 'michael', 'age': 30, 'height': 162, 'email': 'wendy@wendymail.com'}]
We have covered a lot very quickly here. You've now already used the main built in variables of Python, that allow you to store numerical and text data, and the data structures such as lists (which are essentially arrays), and dictionaries (which are essentially objects). Let's now explore this deeper by introducing some of the data science libraries.
# We can import libraries using the following
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Our people dictionary is difficult for us to read clearly
# Pandas DataFrames help manipulate tabular data like this very easily
data = pd.DataFrame(people)
data
name | age | height | ||
---|---|---|---|---|
0 | bob | 23 | 177 | bob@bobmail.com |
1 | john | 41 | 185 | john@johnmail.com |
2 | sophie | 31 | 157 | sophie@sophiemail.com |
3 | wendy | 19 | 164 | wendy@wendymail.com |
4 | amanda | 29 | 174 | wendy@wendymail.com |
5 | daisy | 42 | 176 | wendy@wendymail.com |
6 | michael | 30 | 162 | wendy@wendymail.com |
# We can access individual columns of the data now
data['age']
0 23 1 41 2 31 3 19 4 29 5 42 6 30 Name: age, dtype: int64
# Who is the tallest of the users? Let's find out
data[data['height'] == np.max(data['height'])]
name | age | height | ||
---|---|---|---|---|
1 | john | 41 | 185 | john@johnmail.com |
# Who is the shortest of the users? Let's find out
data[data['height'] == np.min(data['height'])]
name | age | height | ||
---|---|---|---|---|
2 | sophie | 31 | 157 | sophie@sophiemail.com |
# What if we want to plot this data quickly?
data.plot()
plt.show()
plt.bar(range(len(data)), data['age'])
plt.show()
plt.bar(range(len(data)), data['height'], color='orange')
plt.show()
Spend some time researching into Pandas, Matplotlib, and Numpy - these are core to manipulating numerical and tabular data, and then being able to visualize the results. There are many great examples online of getting started with these libraries.