UFCFFY-15-M Cyber Security Analytics¶

Practical Lab 0: Python Primer¶


In this notebook, we give a brief overview of Python and Pandas, looking at the kind of tooling that you will need for the Cyber Security Analytics module. Most of the details here should already be familiar to you - hence why the detail is limited in part - so you should be able to read through and understand the operations being performed in this notebook. If there are aspects that look unfamiliar, play around with the notebook, change the code and see what happens - notebooks are a fantastic resource for modifying and trying out techniques.

For students using the UWEcyber Virtual Machine, to get started with JupyterLab, open the Terminal application and type:

  • python -m jupyterlab

For those who choose to use Windows, it is recommended that you install Anaconda as a Python environment and package management system.

*It is strongly recommended that you try the code samples in your own notebook instance to fully understand the examples.*

To begin - let's show a very quick example of generating some dataset and plotting this. Study the code cell below where we create two Python variables a and b:

  • Which library are we using to generate our data?
  • Which function are we using to generate our data?
  • What do the parameters of the function mean?
In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

a = np.random.normal(1.0, 0.3, [2, 1000])
b = np.random.normal(1.8, 0.2, [2, 100])

If we wish to examine what the variable content is, we can simply type the variable name into an empty cell:

In [17]:
a
Out[17]:
array([[1.43895926, 1.01081998, 0.73561982, ..., 0.27903605, 1.17242986,
        1.68223149],
       [1.10485665, 0.58160588, 0.87423119, ..., 1.04769919, 0.74284941,
        0.87881302]])

We can then plot the variables as points on a scatterplot, where we treat the data as coordinate points (where each point has two attributes, X and Y).

  • Which library are we using to plot the data?
  • Which function are we using to plot the data?
  • How do we specify the X values of variable a?
  • How many coordinate points are in b?
In [18]:
plt.scatter(a[0], a[1])
plt.scatter(b[0], b[1])
Out[18]:
<matplotlib.collections.PathCollection at 0x206039e3e50>

Basic Operations in Python¶

Basic variables in Python include numerical values (integers, floats, doubles, etc.) and string values (text). Basic data structures include lists and dictionaries.

  • Lists are essentially like arrays. We can create a dynamic group of (mixed) variables. We can append and remove from this list, and we can access elements from the list.
  • Dictionaries are like objects. We can create name-value pairs to reference attributes that make up an object (e.g., properties of a car).
  • We can create lists of dictionaries, and we can have lists within dictionaries. We can also have a list of lists (nested lists), and we can have dictionaries as values in a dictionary.
In [26]:
number = 1
text = 'hello_everyone'
l = [1,2,3,4,5]
d = {'name':'bob', 'value':100}
  • What kind of data type is the variable d?
  • What kind of data type is the variable l?
  • How would I retrieve the value 'bob' from variable d?
  • How would I retrieve the 4th element of variable l?

Let's walk through some Python examples¶

First let's look at some basics of Python and get use to manipulating data using the built in variable types.

In [19]:
# First some simple variables

an_integer = 12
a_floating_point_number = 18.4732
In [20]:
# We can do some simple maths on these variables and see the output

an_integer + a_floating_point_number
Out[20]:
30.4732
In [21]:
# We can also write functions to do simple maths

def multiply(number1, number2):
    return number1 * number2

multiply(an_integer, a_floating_point_number)
Out[21]:
221.67839999999998
In [22]:
# We can also create text variables just as easily

a_string = 'Hello there!'
print (a_string)
Hello there!
In [23]:
# We can do some simple manipulation of text

my_name = 'Phil'
message = a_string + ' My name is ' + my_name
print (message)

# Including spliting sentences to corrupt a message
imposter_name = 'Dave'
s_m = message.split(" ")
new_message = ' '.join(s_m[:-1]) + ' ' + imposter_name
print (new_message)
Hello there! My name is Phil
Hello there! My name is Dave
In [24]:
# We also have lists of data that can sort variables
fruits = ['apple','banana','orange','lemon']
print (fruits)
# We can access sets of variables using indexes
print (fruits[0:2])
print (fruits[:-1])
# We can append items to the list, and remove items from the list
fruits.append('mango')
print (fruits)
fruits.remove('banana')
print (fruits)
['apple', 'banana', 'orange', 'lemon']
['apple', 'banana']
['apple', 'banana', 'orange']
['apple', 'banana', 'orange', 'lemon', 'mango']
['apple', 'orange', 'lemon', 'mango']
In [25]:
# We can also create dictionary objects 
# This is helpful for storing related variables about an object

person = {}
person['name'] = 'bob'
person['age'] = 23
person['height'] = 185
person['email'] = 'bob@bobmail.com'
print (person)
{'name': 'bob', 'age': 23, 'height': 185, 'email': 'bob@bobmail.com'}
In [30]:
# Like earlier, we could use a function to create 'person' objects
people = []

def create_person(name, age, height, email):
    global people
    new_person = {'name':name,
                 'age':age,
                 'height':height,
                 'email':email}
    people.append(new_person)

create_person('bob', 23, 177, 'bob@bobmail.com')
create_person('john', 41, 185, 'john@johnmail.com')
create_person('sophie', 31, 157, 'sophie@sophiemail.com')
create_person('wendy', 19, 164, 'wendy@wendymail.com')
create_person('amanda', 29, 174, 'wendy@wendymail.com')
create_person('daisy', 42, 176, 'wendy@wendymail.com')
create_person('michael', 30, 162, 'wendy@wendymail.com')

# Here we store our person objects in our people list
# to make a group of 'persons' - a.k.a. people!
print (people)
[{'name': 'bob', 'age': 23, 'height': 177, 'email': 'bob@bobmail.com'}, {'name': 'john', 'age': 41, 'height': 185, 'email': 'john@johnmail.com'}, {'name': 'sophie', 'age': 31, 'height': 157, 'email': 'sophie@sophiemail.com'}, {'name': 'wendy', 'age': 19, 'height': 164, 'email': 'wendy@wendymail.com'}, {'name': 'amanda', 'age': 29, 'height': 174, 'email': 'wendy@wendymail.com'}, {'name': 'daisy', 'age': 42, 'height': 176, 'email': 'wendy@wendymail.com'}, {'name': 'michael', 'age': 30, 'height': 162, 'email': 'wendy@wendymail.com'}]

Introducing data science libraries¶

We have covered a lot very quickly here. You've now already used the main built in variables of Python, that allow you to store numerical and text data, and the data structures such as lists (which are essentially arrays), and dictionaries (which are essentially objects). Let's now explore this deeper by introducing some of the data science libraries.

In [31]:
# We can import libraries using the following
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [32]:
# Our people dictionary is difficult for us to read clearly
# Pandas DataFrames help manipulate tabular data like this very easily

data = pd.DataFrame(people)
data
Out[32]:
name age height email
0 bob 23 177 bob@bobmail.com
1 john 41 185 john@johnmail.com
2 sophie 31 157 sophie@sophiemail.com
3 wendy 19 164 wendy@wendymail.com
4 amanda 29 174 wendy@wendymail.com
5 daisy 42 176 wendy@wendymail.com
6 michael 30 162 wendy@wendymail.com
In [33]:
# We can access individual columns of the data now
data['age']
Out[33]:
0    23
1    41
2    31
3    19
4    29
5    42
6    30
Name: age, dtype: int64
In [34]:
# Who is the tallest of the users? Let's find out
data[data['height'] == np.max(data['height'])]
Out[34]:
name age height email
1 john 41 185 john@johnmail.com
In [35]:
# Who is the shortest of the users? Let's find out
data[data['height'] == np.min(data['height'])]
Out[35]:
name age height email
2 sophie 31 157 sophie@sophiemail.com
In [36]:
# What if we want to plot this data quickly?
data.plot()
plt.show()
  • What would be a better way to represent this data? (Hint: We already show you below...)
In [51]:
plt.bar(range(len(data)), data['age'])
plt.show()

plt.bar(range(len(data)), data['height'], color='orange')
plt.show()

What next?¶

Spend some time researching into Pandas, Matplotlib, and Numpy - these are core to manipulating numerical and tabular data, and then being able to visualize the results. There are many great examples online of getting started with these libraries.

  • Pandas documentation
  • Matplotlib documentation
  • Numpy documentation
In [ ]: