Reproducibility Assignment

This mandatory assignment was completed by Per Halvorsen, in collaboration with other members of Group 5.

Exercise 1

In this exercise, we were to select an arbitrary classification dataset from UCI's Machine Learning Repository. Our group chose to complete this exercise using the breast cancer in Wisconsin dataset.

The goal of this analysis will be to predict the probability of a patient having cancer based on metrics, provided as a set of features.

Read in the data

The first step is to read in our data, making sure the values present are numerical values that our algorithm will be able to understand. Since our data file did not contain column names, we need to enter them manually.

In [1]:
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

data = pd.read_csv('../data/breast-cancer-wisconsin.data', header=None, index_col=0, na_values='?') #notice the na specification
data.columns =['thickness',
               'uniform_size',
               'uniform_shape',
               'adhesion',
               'epithelial_cell_size',
               'nuclei',
               'chromatin',
               'nucleoli',
               'mitoses',
               'class']

data.dropna(inplace=True) # get rid of the rows with non-numerics
data.shape
Out[1]:
(683, 10)

Here, we had to tell pandas that our data set contained some rows with NA values, labeled as '?'. Since the methods in scikit-learn don't like missing values, we chose to exclude these from our data by calling DataFrame.dropna() on our data.

Assigning features and target variables

It's now time to extract our target variable from the rest of the features. The class column tells us whether the patient has cancer or not, cancer-free=2 and cancer=4, and will therefore be our target variable. The other columns will be our features.

In [2]:
target = data['class']
features = data
features.pop('class')
Out[2]:
0
1000025    2
1002945    2
1015425    2
1016277    2
1017023    2
1017122    4
1018099    2
1018561    2
1033078    2
1033078    2
1035283    2
1036172    2
1041801    4
1043999    2
1044572    4
1047630    4
1048672    2
1049815    2
1050670    4
1050718    2
1054590    4
1054593    4
1056784    2
1059552    2
1065726    4
1066373    2
1066979    2
1067444    2
1070935    2
1070935    2
          ..
1350423    4
1352848    4
1353092    2
1354840    2
1354840    2
1355260    2
1365075    2
1365328    2
1368267    2
1368273    2
1368882    2
1369821    4
1371026    4
1371920    2
466906     2
466906     2
534555     2
536708     2
566346     2
603148     2
654546     2
654546     2
695091     4
714039     2
763235     2
776715     2
841769     2
888820     4
897471     4
897471     4
Name: class, Length: 683, dtype: int64

Split into train and test sets

Let's now split our data into training and testing data, to make sure our end results can be evaluated on completely new data.

In [31]:
from sklearn.model_selection import train_test_split
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.3)
print(f'Train set size: {len(features_train)}\n Test set size: {len(features_test)}')
Train set size: 478
 Test set size: 205

Benchmarking

We'd now like to benchmark the amount of cancer cases in our data set, so we have an estimated lower boundary for the accuracy of our model. To do this, we used scikit-learn's DummyClassifier method.

In [49]:
from sklearn.dummy import DummyClassifier

dummies = pd.get_dummies(target_train)
dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X=features_train, y=dummies)
dummy_clf.score(features_train, dummies)
Out[49]:
0.6548117154811716

The baseline accuracy for our model must be greater than 65% in order for the model to be useful.

Choice of classification algorithm

We are now ready to choose our classification algorithm. Our group decided on using k-nearest-neighbors, which has hyper-parameters weights, metrics, and, most importantly, n_neighbors.

The n_neighbors parameter will tell the algorithm how many neighbors to include when grouping the two possible classifications of our data points. If this number is too low, we will end up with a model much too over-fit to our training data. This means predictions on new data will score very poorly compared their true values. If the number of neighbors is too high, our model will be too general, and accuracy of predictions on new data will again fall. To find that sweet spot of not too high or not too low, we will need to test a range of n_neighbors. We will utilize a cross-validation parameter selection method to do this.

Finding probable k's

Before implementing cross validation, let's quickly see what range of k's we should consider for this data set, to limit how many k's we have to test later during our cross validation later (to limit unnecessary computing).

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

ks = range(2, 200)

models = [KNeighborsClassifier(n_neighbors=k).fit(features_train, target_train)
         for k in ks]

scores = [accuracy_score(target_train, m.predict(features_train)) for m in models]

plt.plot(ks, scores)
Out[5]:
[<matplotlib.lines.Line2D at 0x1b19bc51dd8>]

From our plot above, we can see as the number of neighbors gets much larger than 10, accuracy of the models starts falling rapidly. Therefore, we'll stick to values of n_neighbors between 2 and 10.

In [6]:
ks = range(2, 10)

Tuning parameters

Since there are two other parameters the kNN-algorithm can take, i.e. weights and metric type, we will need to utilize a grid search cross-validation, GridSearchCV. This will run a regular n-fold cross validation for each possible combination of parameters, and will return the three optimal values for our parameters. With these, we will be able to build the most accurate model for classification on our data set.

In [33]:
from sklearn.model_selection import GridSearchCV

grid_params = {
    'n_neighbors': ks,
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

gs = GridSearchCV(KNeighborsClassifier(),
                 grid_params,
                 cv=5,
                 n_jobs=-1) #n_jobs tells GridSearchCV to run search parallel on multiple processors (if available) 

results = gs.fit(features_train, target_train)
In [34]:
best_params = results.best_params_
print(best_params)
results.best_score_
{'metric': 'manhattan', 'n_neighbors': 6, 'weights': 'distance'}
Out[34]:
0.979122807017544

Here, we see our grid search found parameters that gave a 97% accuracy. After resplitting the original data and running these scripts above a few times, we saw some variation of number of neighbors, from between 4-8. The optimal metric and weights parameters also varied with different splits. Given this, the accuracy for every one was roughly 97%. This tells us two things:

  • the metric and weights parameters have little affect on the model
  • the optimal number of neighbors lies within a range, $k \in [3, 9]$

To visualize this range, I'll include a plot of the different scores below, but comment it out for quicker running of the notebook. The output will resemble the following plot: image.png where the x-axis contains the optimal number of neighbors, and the y-axis contains how many times this optimal k was found through the 20 iterations.

In [46]:
# diff_k = []
# for x in range(20):
#     features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.3)
#     gs = GridSearchCV(KNeighborsClassifier(), grid_params, cv=5, n_jobs=-1)
#     res = gs.fit(features_train, target_train)
#     diff_k.append(res.best_params_['n_neighbors'])
# b = max(diff_k)-min(diff_k)
# plt.hist(diff_k, bins=b)
Out[46]:
(array([4., 6., 2., 2., 2., 4.]),
 array([3., 4., 5., 6., 7., 8., 9.]),
 <a list of 6 Patch objects>)

Build optimal model

We can now build the optimal k-nearest-neighbors model, using the best_params found above.

In [47]:
model = KNeighborsClassifier().set_params(**best_params).fit(features_train, target_train)

accuracy_score(target_train, model.predict(features_train))
Out[47]:
1.0

Our model gives a (near) perfect score on predicting it's own data, which is exactly what we would have expected.

Evaluate model

Now we will use our model to estimate target values from features_test, and compare these to their true values, represented in target_test.

In [50]:
accuracy_score(target_test, model.predict(features_test))
Out[50]:
0.975609756097561
In [56]:
from sklearn.metrics import plot_confusion_matrix, confusion_matrix
plot_confusion_matrix(model, features_test, target_test)
confusion_matrix(target_test, model.predict(features_test))
Out[56]:
array([[129,   2],
       [  3,  71]], dtype=int64)
In [57]:
len(target_test)
Out[57]:
205

As we can see, the model was pretty good at classifying the presence of cancer in the test patients. Only a total of 5 observations, out of the 205 in the test set, were misclassified, 3 false positives and 2 false negatives. The overall accuracy was about 97.5%.

In reality, the true accuracy may be slightly smaller, for a few reasons:

  1. This model is trained on feature columns that fall within a certain range. There is no reason to believe the true maximum values for any of these features is actually included in our data set. New patient data with feature values outside of this range may throw our model off, and provide false classifications.
  2. As mentioned above, different splits give us different optimal hyper-parameters, which in turn hints to a slight inaccuracy of our model.
  3. The number of folds used in the cross validation could also affect how accurate these optimal parameters could be. To get the most out of this data set, we could have used a leave-one-out cross validation. We deemed this unnecessary, since our score was already so high.