In this note, we will review what we did last week, show how to use pipelines in sklearn, use sklearn's cross_val_score, look at forward stepwise feature selection, and be introduced to the Python library modeled to look like R
Although there will be presented very formal mathematics within probability, with measures and the like, we are free to be as formal or informal as we prefer. The idea of presenting us with these formal definitions is to show us exactly what these different concepts mean.
The most important for this course is that we learn how do calculate probabilities on the computer, and can use computers as statistical tools for analysis.
Also, goal is to show how ML is done in the industry, using data through pipelines, instead of loose numpy arrays.
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
#!tail -n 14 data/adult.names #doesnt work for me on windows
colnames = [
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'martial-status',
'occupation',
'relationship',
'race',
'sex',
'cap-gain',
'cap-loss',
'hours-per-week',
'native-country',
'income'
]
def read_data(n=None):
colnames = [
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'martial-status',
'occupation',
'relationship',
'race',
'sex',
'cap-gain',
'cap-loss',
'hours-per-week',
'native-country',
'income'
]
df = pd.read_csv('IN-STK5000-Notebooks-2020/data/adult.data.gz', names=colnames)
# Here you would usually do more checks
# is n bigger than our data set
# how big sample size is needed, and so on
# but for now, we just assume the general case that n is acceptable
if n:
df = df.sample(n)
df.index = range(n)
target = (df.income == ' >50K')*1
df.pop('income')
return df, target
features, target = read_data(2000)
Sometimes your training data may not include all the possible values for some categories. If this new category is included in your test set, then your model may blow up.
Let's look at using a pipeline in sklearn.
cat_columns = ['sex', 'education', 'race']
cont_columns = ['age', 'education-num']
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
cat_trans = Pipeline(steps=[('onehot', OneHotEncoder(drop='first'))])
# if you were to not drop this first column, it would still work
# but sklearn tries to restrict some values, so its a good idea to not trust this..?
cont_trans = Pipeline(steps=[('scale', StandardScaler())])
feature_trans = ColumnTransformer(
transformers=[('categorical', cat_trans, cat_columns),
('continuous', cont_trans, cont_columns)])
classifier = Pipeline(steps=[('feature_transform', feature_trans),
('classifier', KNeighborsClassifier(n_neighbors=35))])
classifier
from sklearn import set_config #this is somewhat a gimmick
set_config(display='diagram')
classifier
This could be pretty useful for seeing the structure of your pipelines.
from sklearn.model_selection import cross_val_score
# ?cross_val_score
cv_scores = cross_val_score(classifier, features, target, cv=20)
cv_scores
cv_scores.mean()
plt.hist(cv_scores, bins=5)
from sklearn.utils import resample
plt.hist([resample(cv_scores).mean() for _ in range(500)])
This can be run over and over, and every time it will give slightly different results.
plt.hist([resample(cv_scores).mean() for _ in range(500)])
None #this is the same thing, it's just the None hides the extra infomation about the array
ks = list(range(10, 100, 10))
score_results = [cross_val_score(Pipeline(steps=[('feature_transform', feature_trans),
('classifier', KNeighborsClassifier(n_neighbors=k))]), features, target, cv=5)
for k in ks]
score_results
plt.plot(ks, [s.mean() for s in score_results])
plt.plot(ks, [s.mean() for s in score_results])
plt.plot(ks, [s.min() for s in score_results])
plt.plot(ks, [s.max() for s in score_results])
Green:max, Blue:mean, Orange:min
from sklearn.model_selection import GridSearchCV
param_grid = {'classifier__n_neighbors':ks} # the name of the step and the parameter
grid_search = GridSearchCV(classifier, param_grid, cv=10)
fit_result = grid_search.fit(features, target)
This finds the best parameters, which build your best model.
fit_result.best_estimator_.get_params()['steps']
Here we see, the alg. chose n_neighbors=70 as the best number of neighbors.
dummies = pd.get_dummies(features)
dummies
for column in dummies:
corr = dummies[column].corr(target)
print(f'{column}: {corr}')
Correlation will work if you have something that looks like a linear.
from sklearn.tree import DecisionTreeClassifier
columns = list(features.columns)
selected_features = []
scores = []
N = 10
while len(selected_features) < N:
best_score = pd.Series([0])
best_feature = None
for feature in columns:
score = cross_val_score(DecisionTreeClassifier(max_depth=10),
pd.get_dummies(features[selected_features + [feature]]), target)
if score.mean() > best_score.mean():
best_feature = feature
best_score = score
print(f'{best_feature}: {best_score} ({best_score.mean()})')
columns.remove(best_feature)
selected_features.append(best_feature)
scores.append(best_score)
plt.plot([s.mean() for s in scores])
plt.plot([s.min() for s in scores])
plt.plot([s.max() for s in scores])
Let's see what results we get for a completely random data set. Hint: there should be no correlation between features!
import numpy
N = 200
random_features = pd.DataFrame(numpy.random.normal(size=(N,N)))
random_target = numpy.random.choice([0, 1], size=N)
from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(random_features, random_target, test_size=0.2)
columns = list(Xtr.columns)
selected_features = []
scores = []
N = 5
while len(selected_features) < N:
best_score = pd.Series([0])
best_feature = None
for feature in columns:
score = cross_val_score(DecisionTreeClassifier(max_depth=10),
pd.get_dummies(Xtr[selected_features + [feature]]), ytr)
if score.mean() > best_score.mean():
best_feature = feature
best_score = score
print(f'{best_feature}: {best_score} ({best_score.mean()})')
columns.remove(best_feature)
selected_features.append(best_feature)
scores.append(best_score)
plt.plot([s.mean() for s in scores])
model = DecisionTreeClassifier(max_depth=10).fit(Xtr[selected_features], ytr)
from sklearn.metrics import accuracy_score
accuracy_score(yte, model.predict(Xte[selected_features]))
When doing feature selection:
cv_scores_r = cross_val_score(DecisionTreeClassifier(max_depth=10), random_features[selected_features], random_target, cv=30)
cv_scores_r
plt.hist(cv_scores_r)
import statsmodels.formula.api as smf
features
X = features.copy()
X['target'] = target
X
sm_fit = smf.logit("target ~ age + Q('education-num') + sex + race", data=X).fit() #logistic regression
sm_fit.summary()