For this “Hello World” example, you are working on a problem given to you by the manager of the Security Operations Center (SOC). It seems the SOC analysts are becoming inundated with “trivial” alerts ever since a new data set of indicators was introduced into the Security Information and Event Management (SIEM) system. They have asked for your help in reducing the number of “trivial” alerts with- out sacrificing visibility.
This is a good problem to tackle through data analysis, and we should be able to form a solid, practical question to ask after we perform some exploratory data analysis and hopefully arrive at an answer that helps out the SOC.
This example is adapted from the textbook "Data Driven Security: Analysis, Visualization and Dashboards" by Bob Rudis, Jay Jacobs.
Data: You will need to access the Lab 01 Dataset available on Blackboard to complete this task.
It is strongly recommended that you try the code samples in your own notebook instance to fully understand the examples.
import pandas as pd
import sys
# read in the data into a pandas data frame
avRep = "./example_data/reputation.data"
av = pd.read_csv(avRep, sep="#", header=None)
# make smarter column names
av.columns = ["IP","Reliability","Risk","Type","Country", "Locale","Coords","x"]
av
IP | Reliability | Risk | Type | Country | Locale | Coords | x | |
---|---|---|---|---|---|---|---|---|
0 | 222.76.212.189 | 4 | 2 | Scanning Host | CN | Xiamen | 24.4797992706,118.08190155 | 11 |
1 | 222.76.212.185 | 4 | 2 | Scanning Host | CN | Xiamen | 24.4797992706,118.08190155 | 11 |
2 | 222.76.212.186 | 4 | 2 | Scanning Host | CN | Xiamen | 24.4797992706,118.08190155 | 11 |
3 | 5.34.246.67 | 6 | 3 | Spamming | US | NaN | 38.0,-97.0 | 12 |
4 | 178.94.97.176 | 4 | 5 | Scanning Host | UA | Merefa | 49.8230018616,36.0507011414 | 11 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
258621 | 179.244.194.219 | 4 | 2 | Spamming | BR | NaN | -10.0,-55.0 | 12 |
258622 | 216.99.159.166 | 4 | 2 | Scanning Host | US | Walnut | 34.0115013123,-117.853500366 | 11 |
258623 | 216.99.159.169 | 3 | 2 | Scanning Host | US | Walnut | 34.0115013123,-117.853500366 | 11 |
258624 | 216.99.159.176 | 3 | 2 | Scanning Host | US | Walnut | 34.0115013123,-117.853500366 | 11 |
258625 | 216.99.159.117 | 3 | 3 | Scanning Host | US | Walnut | 34.0115013123,-117.853500366 | 11 |
258626 rows × 8 columns
Before going any deeper lets just look at the data so that we know what we are working with:
Reliability, Risk, and x are integers.
IP, Type, Country, Locale, and Coords are character strings.
The IP address is stored in the dotted-quad notation, not in hostnames or decimal format.
Each record is associated with a unique IP address, so there are 258,626 IP addresses (in this download).
Each IP address has been geo-located into the latitude and longitude pair in the Coords field, but they are in a single field separated by a comma. You will have to parse that further if you want to use that field.
What do we mean by descriptive statistics? As the name suggests, these essential describe the properties of our data. They help for summarisation, and for providing easier forms of comparison when consider two groups of data. You will likely be familiar with some of these concepts, but nevertheless, it is important to think further about how they can be used to summarise a data set (and also, if there are any potential issues are with using them and how we can overcome these?)
Commonly used descriptive statistics include:
Whilst we can calculate these "in code", or use in-built functions such as np.mean(), Pandas provides a convenient describe() function that will perform all of these together.
av['Reliability'].describe()
count 258626.000000 mean 2.798040 std 1.130419 min 1.000000 25% 2.000000 50% 2.000000 75% 4.000000 max 10.000000 Name: Reliability, dtype: float64
av['Risk'].describe()
count 258626.000000 mean 2.221362 std 0.531571 min 1.000000 25% 2.000000 50% 2.000000 75% 2.000000 max 7.000000 Name: Risk, dtype: float64
Above, note how we can select the specific column using av['Reliability'] or av['Risk'].
An important note to make (from the Alienvault documentation) is that Risk and Reliability are scored 1-10, however these are ordinal values rather than numerical.
What does this mean? Essentially, ordinal values denote order, however they are not quantities. Therefore, a score of 4 is not specifically twice the risk of 2, however it is greater.
def factor_col(col):
factor = pd.Categorical(col)
return pd.value_counts(factor,sort=False)
rel_ct = pd.value_counts(av['Reliability'])
risk_ct = pd.value_counts(av['Risk'])
type_ct = pd.value_counts(av['Type'])
country_ct = pd.value_counts(av['Country'])
print ("--- Reliability ---")
print (factor_col(av['Reliability']))
print ("\n ")
print ("--- Risk ---")
print (factor_col(av['Risk']))
print ("\n ")
print ("--- Type ---")
print (factor_col(av['Type']).head(n=10))
print ("\n ")
print ("--- Country ---")
print (factor_col(av['Country']).head(n=10))
print ("\n ")
--- Reliability --- 1 5612 2 149117 3 10892 4 87040 5 7 6 4758 7 297 8 21 9 686 10 196 dtype: int64 --- Risk --- 1 39 2 213852 3 33719 4 9588 5 1328 6 90 7 10 dtype: int64 --- Type --- APT;Malware Domain 1 C&C 610 C&C;Malware Domain 31 C&C;Malware IP 20 C&C;Scanning Host 7 Malicious Host 3770 Malicious Host;Malware Domain 4 Malicious Host;Malware IP 2 Malicious Host;Scanning Host 163 Malware Domain 9274 dtype: int64 --- Country --- A1 267 A2 2 AE 1827 AL 4 AM 6 AN 3 AO 256 AR 3046 AT 51 AU 155 dtype: int64
Above, we define our own function called factor_col() that will essentially identify all possible values within a given dataset, and count the number of occurrence for each. A similar function Pandas can called using group_by.
import matplotlib.pyplot as plt
import seaborn as sb
# sort by country
country_ct = pd.value_counts(av['Country'])
# plot the data
plt.axes(frameon=0) # reduce chart junk
country_ct[:20].plot(kind='bar', rot=0, title="Summary By Country", figsize=(20,10)).grid(False)
plt.show()
plt.axes(frameon=0) # reduce chart junk
factor_col(av['Reliability']).plot(kind='bar', rot=0, title="Summary By 'Reliability'", figsize=(20,10)).grid(False)
plt.show()
plt.axes(frameon=0) # reduce chart junk
factor_col(av['Risk']).plot(kind='bar', rot=0, title="Summary By 'Risk'", figsize=(20,10)).grid(False)
plt.show()
Above, we have created bar plots that show the values for each attribute, allowing us to examine these in greater detail.
top10 = pd.value_counts(av['Country'])[0:9]
# calculate the % for each of the top 10
top10.astype(float) / len(av['Country'])
CN 0.265182 US 0.194826 TR 0.053970 DE 0.038484 NL 0.030666 RU 0.024537 GB 0.024332 IN 0.021189 FR 0.021069 Name: Country, dtype: float64
Note that above, we have divided through by the length of the Country column, essentially giving a percentage of countries rather than absolute counts.
Perhaps we want to look at both risk and reliability together? We can use a crosstab to achieve this.
from matplotlib import cm
from numpy import arange
print(pd.crosstab(av['Risk'], av['Reliability']).to_string())
Reliability 1 2 3 4 5 6 7 8 9 10 Risk 1 0 0 16 7 0 8 8 0 0 0 2 804 149114 3670 57653 4 2084 85 11 345 82 3 2225 3 6668 22168 2 2151 156 7 260 79 4 2129 0 481 6447 0 404 43 2 58 24 5 432 0 55 700 1 103 5 1 20 11 6 19 0 2 60 0 8 0 0 1 0 7 3 0 0 5 0 0 0 0 2 0
# graphical view of contingency table (swapping risk/reliability)
xtab = pd.crosstab(av['Reliability'], av['Risk'])
fig = plt.figure(figsize=(5,5))
plt.pcolormesh(xtab,cmap=cm.Greens, figure=fig)
plt.yticks(arange(0.5,len(xtab.index), 1),xtab.index)
plt.xticks(arange(0.5,len(xtab.columns), 1),xtab.columns)
plt.colorbar()
plt.title("Risk ~ Reliability")
plt.show()
Ok, so this starts to highlight some interesting details about risk and reliability however it is lacking in context - can we identify risk/reliability against each type of alert as observed in the SOC? Let's try below.
# create new column as a copy of Type column
av['newtype'] = av['Type']
# replace multi-Type entries with “Multiples”
av[av['newtype'].str.contains(";")] = "Multiples"
# setup new crosstab structures
typ = av['newtype']
rel = av['Reliability']
rsk = av['Risk']
# compute crosstab making it split on the
# new “type” column
xtab = pd.crosstab(typ, [ rel, rsk ], rownames=['typ'], colnames=['rel', 'rsk'])
print (xtab.to_string())
rel 1 2 3 4 5 6 7 8 9 10 Multiples rsk 2 3 4 5 6 7 2 3 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 5 1 2 3 4 5 6 1 2 3 4 5 2 3 4 5 2 3 4 5 6 7 2 3 4 5 Multiples typ C&C 0 0 1 2 1 0 0 0 0 0 0 313 22 2 0 0 0 15 22 4 1 0 0 1 0 0 0 98 60 5 0 0 0 7 3 0 0 1 1 0 0 19 16 1 1 0 1 8 5 0 Malicious Host 0 6 51 41 8 1 0 0 1 206 2250 7 2 0 0 152 512 336 138 30 2 1 0 0 1 3 8 8 4 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 Malware Domain 12 1 0 0 0 0 7309 0 2 246 55 2 1 0 0 60 18 2 0 0 0 2 1 0 2 921 273 26 2 0 3 72 13 0 0 7 1 1 0 135 38 6 0 0 0 54 7 2 0 0 Malware IP 0 23 11 15 10 2 0 3 12 415 4091 71 6 0 1 132 205 122 45 13 2 0 1 0 3 10 793 133 11 3 5 0 140 35 0 0 6 0 0 1 74 10 0 0 0 0 53 11 2 0 Malware distribution 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Multiples 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 834 Scanning Host 790 2189 2056 366 0 0 141543 0 1 2685 159 35 13 0 6 55654 21325 5931 488 13 0 1 0 0 2 611 107 23 1 0 0 0 0 0 0 2 0 0 0 150 22 7 0 0 0 0 0 0 0 0 Spamming 1 2 9 7 0 0 1 0 0 22 9 17 6 0 0 1536 40 21 4 0 0 0 0 0 0 512 931 106 17 0 0 4 1 0 2 1 0 0 0 52 120 15 3 0 0 24 17 3 4 0
This data is difficult to observe in tabular form - as discussed, there is simply too much and it is nested which also makes it challenging to follow.
Instead, let's consider a bar chart.
xtab.plot(kind='bar',legend=False, title="Risk ~ Reliabilty | Type", figsize=(20,10)).grid(False)
plt.show()
Excellent! We have a bar chart that shows the combined risk/reliability measures against each type of SOC alert. This starts to look useful. However, perhaps we want to exclude Scanning Host - we expect this behaviour on our network and showing this is making it harder to observe other details about the data. Let's exclude this next.
# Here we remove Scanning Host
rrt_df = av[av['newtype'] != "Scanning Host"]
# And then we do the chart again
typ = rrt_df['newtype']
rel = rrt_df['Reliability']
rsk = rrt_df['Risk']
xtab = pd.crosstab(typ, [ rel, rsk ], rownames=['typ'], colnames=['rel', 'rsk'])
xtab.plot(kind='bar',legend=False, title="Risk ~ Reliabilty | Type", figsize=(20,10)).grid(False)
plt.show()
Ok this looks more interesting now. We see Malware Domain and Malware distribution cropping up, which would make sense - however we may not necessarily be interested in these for this particular story. Let's exclude these and see what we are left with.
rrt_df = rrt_df[rrt_df['newtype'] != "Malware distribution" ]
rrt_df = rrt_df[rrt_df['newtype'] != "Malware Domain" ]
typ = rrt_df['newtype']
rel = rrt_df['Reliability']
rsk = rrt_df['Risk']
xtab = pd.crosstab(typ, [ rel, rsk ], rownames=['typ'], colnames=['rel', 'rsk'])
print ("Count: %d; Percent: %2.1f%%" % (len(rrt_df), (float(len(rrt_df)) / len(av)) * 100))
## Count: 15171; Percent: 5.9%
xtab.plot(kind='bar',legend=False, title="Risk ~ Reliabilty | Type", figsize=(20,10)).grid(False)
plt.show()
xtab
Count: 15171; Percent: 5.9%
rel | 1 | 2 | 3 | ... | 9 | 10 | Multiples | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rsk | 2 | 3 | 4 | 5 | 6 | 7 | 2 | 3 | 1 | 2 | ... | 3 | 4 | 5 | 6 | 7 | 2 | 3 | 4 | 5 | Multiples |
typ | |||||||||||||||||||||
C&C | 0 | 0 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 19 | 16 | 1 | 1 | 0 | 1 | 8 | 5 | 0 |
Malicious Host | 0 | 6 | 51 | 41 | 8 | 1 | 0 | 0 | 1 | 206 | ... | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Malware IP | 0 | 23 | 11 | 15 | 10 | 2 | 0 | 3 | 12 | 415 | ... | 74 | 10 | 0 | 0 | 0 | 0 | 53 | 11 | 2 | 0 |
Multiples | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 834 |
Spamming | 1 | 2 | 9 | 7 | 0 | 0 | 1 | 0 | 0 | 22 | ... | 120 | 15 | 3 | 0 | 0 | 24 | 17 | 3 | 4 | 0 |
5 rows × 50 columns
We have now managed to filter our data down to reveal aspects about malware IP, malicious hosts, as well as command and control servers, spamming addresses, and multiples (which would require a separate investigation). Importantly, we can gain more insight into these now, and we have worked logically through to filter unnecessary information for our story - we now want to learn more about the malware IPs since there are a great number of these. Also worth noting, we are now working with 5.9% of our original data (15171 rows) making it much more manageable to explore and find relavent details, rather than being inudated with irrelavent information. Our SOC team were struggling with the number of alerts they were dealing with - this workflow would allow them to manage the alerts much more effectively, and concentrate on the key details of interest.
Having developed this in a Notebook for the purpose of exploration, we could easier export this as a Python script that would run periodically to filter our alerts as needed.
This example should help to demonstrate the benefit of interactive analysis of the data, and how this can be used to rapidly design a suitable analysis workflow for deployment.