In this session we will cover:
A decision model that is not explicitly programmed, but instead is learnt through the training of example data.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves. We can essentially think of three components: an input (the data), a function, and an output.
The machine is trying to learn the function that can map inputs to some form of output, which is achieved by minimising some error measure (sometimes described as maximising a fitness function). One possible approach is to consider the difference between the expected output and the predicted output as the error measure - training is the process of iterative steps to reduce this measure.
This is where data is provided with training labels that inform the machine what the data is. For example, if we have a set of images that we wish to classify, we can state that the output is what the image depicts (for example, is it a cat or a dog?). The input to the system is the raw image data assessed based on colour pixel values. The output of the system is a label of either 'cat' or 'dog'. The function that converts the pixel input vector to a label output vector is what the machine attempts to learn. This is often described as classification.
This is where data is provided however no training labels are given. We want to learn some underlying properties about the data. Examples include clustering, where we aim to find hidden distributions within a larger set of data, association rules such as those used to identify shopping trends and similarities between consumer groups, and autoencoders that aim to learn suitable compression and decompression mechanisms for data representation and feature engineering.
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
X = np.array([ [1, 1], [1, 2], [2, 2], [2, 3] ])
#X = np.random.random([100,2])
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)
print ("X: ", X)
print ("y: ", y)
print ("Score: ", reg.score(X, y))
print ("Co-efficients: ", reg.coef_)
print ("Intecept: ",reg.intercept_)
print ("Prediction: ", reg.predict(np.array([[3, 5]])) )
#plt.scatter(X[:,0], X[:,1])
#plt.show()
X: [[1 1] [1 2] [2 2] [2 3]] y: [ 6 8 9 11] Score: 1.0 Co-efficients: [1. 2.] Intecept: 3.0000000000000018 Prediction: [16.]
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
xs = X[:,0]
ys = X[:,1]
zs = y
ax.scatter(xs, ys, zs)
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
plt.show()
# https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
preds = np.array([[2, 1], [9, 3]])
kmeans = KMeans(n_clusters=2, random_state=42).fit(X)
print ("Labels: ", kmeans.labels_)#
print ("Predictions: ", kmeans.predict(preds) )
print ("Centroids: ", kmeans.cluster_centers_)
plt.scatter(X[:,0], X[:,1])
plt.scatter(preds[:,0], preds[:,1])
plt.show()
Labels: [1 1 1 0 0 0] Predictions: [1 0] Centroids: [[10. 2.] [ 1. 2.]]
# https://scikit-learn.org/stable/modules/svm.html
from sklearn import svm
X = np.array([[0, 0], [1, 1]])
y = np.array([0, 1])
clf = svm.SVC()
clf.fit(X, y)
print ("Prediction: ", clf.predict(np.array([[2., 2.]])))
plt.scatter(X[:,0], X[:,1])
plt.show()
Prediction: [1]
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
data, labels = load_digits(return_X_y=True)
(n_samples, n_features), n_digits = data.shape, np.unique(labels).size
print("Original data: #digits:", n_digits, " #samples:", n_samples, " #features:", n_features)
sample = 10
plt.imshow(data[sample,:].reshape(8,8))
plt.show()
print("Label: ", labels[sample])
Original data: #digits: 10 #samples: 1797 #features: 64
Label: 0
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
result = pca.fit_transform(data)
print ("Shape: ", result.shape)
plt.scatter(result[:,0], result[:,1], c=labels)
# plt.scatter(result[sample,0], result[sample,1], s=200, marker='X', color='red') - show sample in cluster
plt.show()
Shape: (1797, 2)
# https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=100, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)
print ("Prediction (Probability): ", clf.predict_proba(X_test[:1]))
print ("Prediction (Class Label): ", clf.predict(X_test[:5, :]))
print ("Accuracy: ", clf.score(X_test, y_test))
Prediction (Probability): [[0.00415509 0.99584491]] Prediction (Class Label): [1 1 1 0 1] Accuracy: 0.96
How should we characterise data for use with a classifier?
What about sequential data such as text?
SDAV - More available on Statistical Analysis, Text Analytics, and Text Analytics Practical Examples