This mandatory assignment was completed by Per Halvorsen, in collaboration with other members of Group 5.
In this exercise, we were to select an arbitrary classification dataset from UCI's Machine Learning Repository. Our group chose to complete this exercise using the breast cancer in Wisconsin dataset.
The goal of this analysis will be to predict the probability of a patient having cancer based on metrics, provided as a set of features.
The first step is to read in our data, making sure the values present are numerical values that our algorithm will be able to understand. Since our data file did not contain column names, we need to enter them manually.
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
data = pd.read_csv('../data/breast-cancer-wisconsin.data', header=None, index_col=0, na_values='?') #notice the na specification
data.columns =['thickness',
'uniform_size',
'uniform_shape',
'adhesion',
'epithelial_cell_size',
'nuclei',
'chromatin',
'nucleoli',
'mitoses',
'class']
data.dropna(inplace=True) # get rid of the rows with non-numerics
data.shape
Here, we had to tell pandas
that our data set contained some rows with NA values, labeled as '?'. Since the methods in scikit-learn don't like missing values, we chose to exclude these from our data by calling DataFrame.dropna()
on our data.
It's now time to extract our target variable from the rest of the features. The class
column tells us whether the patient has cancer or not, cancer-free=2 and cancer=4, and will therefore be our target variable. The other columns will be our features.
target = data['class']
features = data
features.pop('class')
Let's now split our data into training and testing data, to make sure our end results can be evaluated on completely new data.
from sklearn.model_selection import train_test_split
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.3)
print(f'Train set size: {len(features_train)}\n Test set size: {len(features_test)}')
We'd now like to benchmark the amount of cancer cases in our data set, so we have an estimated lower boundary for the accuracy of our model. To do this, we used scikit-learn's DummyClassifier
method.
from sklearn.dummy import DummyClassifier
dummies = pd.get_dummies(target_train)
dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X=features_train, y=dummies)
dummy_clf.score(features_train, dummies)
The baseline accuracy for our model must be greater than 65% in order for the model to be useful.
We are now ready to choose our classification algorithm. Our group decided on using k-nearest-neighbors, which has hyper-parameters weights
, metrics
, and, most importantly, n_neighbors
.
The n_neighbors
parameter will tell the algorithm how many neighbors to include when grouping the two possible classifications of our data points. If this number is too low, we will end up with a model much too over-fit to our training data. This means predictions on new data will score very poorly compared their true values. If the number of neighbors is too high, our model will be too general, and accuracy of predictions on new data will again fall. To find that sweet spot of not too high or not too low, we will need to test a range of n_neighbors
. We will utilize a cross-validation parameter selection method to do this.
Before implementing cross validation, let's quickly see what range of k's we should consider for this data set, to limit how many k's we have to test later during our cross validation later (to limit unnecessary computing).
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
ks = range(2, 200)
models = [KNeighborsClassifier(n_neighbors=k).fit(features_train, target_train)
for k in ks]
scores = [accuracy_score(target_train, m.predict(features_train)) for m in models]
plt.plot(ks, scores)
From our plot above, we can see as the number of neighbors gets much larger than 10, accuracy of the models starts falling rapidly. Therefore, we'll stick to values of n_neighbors
between 2 and 10.
ks = range(2, 10)
Since there are two other parameters the kNN-algorithm can take, i.e. weights and metric type, we will need to utilize a grid search cross-validation, GridSearchCV
. This will run a regular n-fold cross validation for each possible combination of parameters, and will return the three optimal values for our parameters. With these, we will be able to build the most accurate model for classification on our data set.
from sklearn.model_selection import GridSearchCV
grid_params = {
'n_neighbors': ks,
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan']
}
gs = GridSearchCV(KNeighborsClassifier(),
grid_params,
cv=5,
n_jobs=-1) #n_jobs tells GridSearchCV to run search parallel on multiple processors (if available)
results = gs.fit(features_train, target_train)
best_params = results.best_params_
print(best_params)
results.best_score_
Here, we see our grid search found parameters that gave a 97% accuracy. After resplitting the original data and running these scripts above a few times, we saw some variation of number of neighbors, from between 4-8. The optimal metric and weights parameters also varied with different splits. Given this, the accuracy for every one was roughly 97%. This tells us two things:
To visualize this range, I'll include a plot of the different scores below, but comment it out for quicker running of the notebook. The output will resemble the following plot: where the x-axis contains the optimal number of neighbors, and the y-axis contains how many times this optimal k was found through the 20 iterations.
# diff_k = []
# for x in range(20):
# features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.3)
# gs = GridSearchCV(KNeighborsClassifier(), grid_params, cv=5, n_jobs=-1)
# res = gs.fit(features_train, target_train)
# diff_k.append(res.best_params_['n_neighbors'])
# b = max(diff_k)-min(diff_k)
# plt.hist(diff_k, bins=b)
We can now build the optimal k-nearest-neighbors model, using the best_params
found above.
model = KNeighborsClassifier().set_params(**best_params).fit(features_train, target_train)
accuracy_score(target_train, model.predict(features_train))
Our model gives a (near) perfect score on predicting it's own data, which is exactly what we would have expected.
Now we will use our model to estimate target values from features_test
, and compare these to their true values, represented in target_test
.
accuracy_score(target_test, model.predict(features_test))
from sklearn.metrics import plot_confusion_matrix, confusion_matrix
plot_confusion_matrix(model, features_test, target_test)
confusion_matrix(target_test, model.predict(features_test))
len(target_test)
As we can see, the model was pretty good at classifying the presence of cancer in the test patients. Only a total of 5 observations, out of the 205 in the test set, were misclassified, 3 false positives and 2 false negatives. The overall accuracy was about 97.5%.
In reality, the true accuracy may be slightly smaller, for a few reasons: