In this note, we will review what we did last week, show how to use pipelines in sklearn, use sklearn's cross_val_score, look at forward stepwise feature selection, and be introduced to the Python library modeled to look like R

Although there will be presented very formal mathematics within probability, with measures and the like, we are free to be as formal or informal as we prefer. The idea of presenting us with these formal definitions is to show us exactly what these different concepts mean.

The most important for this course is that we learn how do calculate probabilities on the computer, and can use computers as statistical tools for analysis.

Also, goal is to show how ML is done in the industry, using data through pipelines, instead of loose numpy arrays.

Review

In [2]:
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
In [3]:
#!tail -n 14 data/adult.names  #doesnt work for me on windows
colnames = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'martial-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'cap-gain',
    'cap-loss',
    'hours-per-week',
    'native-country',
    'income'
]
In [4]:
def read_data(n=None):
    colnames = [
        'age',
        'workclass',
        'fnlwgt',
        'education',
        'education-num',
        'martial-status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'cap-gain',
        'cap-loss',
        'hours-per-week',
        'native-country',
        'income'
    ]
    df = pd.read_csv('IN-STK5000-Notebooks-2020/data/adult.data.gz', names=colnames)
    # Here you would usually do more checks
    # is n bigger than our data set
    # how big sample size is needed, and so on
    # but for now, we just assume the general case that n is acceptable
    if n:
        df = df.sample(n)
        df.index = range(n)
        
    target = (df.income == ' >50K')*1
    df.pop('income')
    return df, target
In [5]:
features, target = read_data(2000)

Sometimes your training data may not include all the possible values for some categories. If this new category is included in your test set, then your model may blow up.

Pipelines

Let's look at using a pipeline in sklearn.

In [6]:
cat_columns = ['sex', 'education', 'race']
cont_columns = ['age', 'education-num']
In [7]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
In [8]:
cat_trans = Pipeline(steps=[('onehot', OneHotEncoder(drop='first'))])
# if you were to not drop this first column, it would still work
# but sklearn tries to restrict some values, so its a good idea to not trust this..?
In [9]:
cont_trans = Pipeline(steps=[('scale', StandardScaler())])

feature_trans = ColumnTransformer(
    transformers=[('categorical', cat_trans, cat_columns),
                  ('continuous', cont_trans, cont_columns)])

classifier = Pipeline(steps=[('feature_transform', feature_trans),
                             ('classifier', KNeighborsClassifier(n_neighbors=35))])
In [10]:
classifier
Out[10]:
Pipeline(steps=[('feature_transform',
                 ColumnTransformer(transformers=[('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['sex', 'education', 'race']),
                                                 ('continuous',
                                                  Pipeline(steps=[('scale',
                                                                   StandardScaler())]),
                                                  ['age', 'education-num'])])),
                ('classifier', KNeighborsClassifier(n_neighbors=35))])
In [11]:
from sklearn import set_config  #this is somewhat a gimmick
set_config(display='diagram')
In [12]:
classifier
Out[12]:
Pipeline(steps=[('feature_transform',
                 ColumnTransformer(transformers=[('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['sex', 'education', 'race']),
                                                 ('continuous',
                                                  Pipeline(steps=[('scale',
                                                                   StandardScaler())]),
                                                  ['age', 'education-num'])])),
                ('classifier', KNeighborsClassifier(n_neighbors=35))])
ColumnTransformer(transformers=[('categorical',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(drop='first'))]),
                                 ['sex', 'education', 'race']),
                                ('continuous',
                                 Pipeline(steps=[('scale', StandardScaler())]),
                                 ['age', 'education-num'])])
['sex', 'education', 'race']
OneHotEncoder(drop='first')
['age', 'education-num']
StandardScaler()
KNeighborsClassifier(n_neighbors=35)

This could be pretty useful for seeing the structure of your pipelines.

Cross-validation

In [13]:
from sklearn.model_selection import cross_val_score
# ?cross_val_score
In [14]:
cv_scores = cross_val_score(classifier, features, target, cv=20)
In [15]:
cv_scores
Out[15]:
array([0.79, 0.78, 0.77, 0.79, 0.76, 0.82, 0.85, 0.81, 0.82, 0.77, 0.8 ,
       0.8 , 0.84, 0.83, 0.79, 0.8 , 0.76, 0.81, 0.8 , 0.83])
In [66]:
cv_scores.mean()
Out[66]:
0.8045
In [67]:
plt.hist(cv_scores, bins=5)
Out[67]:
(array([1., 8., 5., 5., 1.]),
 array([0.74 , 0.766, 0.792, 0.818, 0.844, 0.87 ]),
 <a list of 5 Patch objects>)
In [54]:
from sklearn.utils import resample
In [63]:
plt.hist([resample(cv_scores).mean() for _ in range(500)])
Out[63]:
(array([  4.,   9.,  55.,  85.,  96., 111.,  66.,  52.,  17.,   5.]),
 array([0.7865 , 0.79005, 0.7936 , 0.79715, 0.8007 , 0.80425, 0.8078 ,
        0.81135, 0.8149 , 0.81845, 0.822  ]),
 <a list of 10 Patch objects>)

This can be run over and over, and every time it will give slightly different results.

In [65]:
plt.hist([resample(cv_scores).mean() for _ in range(500)])
None   #this is the same thing, it's just the None hides the extra infomation about the array
In [69]:
ks = list(range(10, 100, 10))
In [74]:
score_results = [cross_val_score(Pipeline(steps=[('feature_transform', feature_trans),
                             ('classifier', KNeighborsClassifier(n_neighbors=k))]), features, target, cv=5)
                for k in ks]
In [76]:
score_results
Out[76]:
[array([0.8025, 0.805 , 0.8   , 0.82  , 0.7925]),
 array([0.7875, 0.8075, 0.795 , 0.8025, 0.7875]),
 array([0.8   , 0.81  , 0.8175, 0.8025, 0.785 ]),
 array([0.8   , 0.8125, 0.82  , 0.81  , 0.785 ]),
 array([0.8025, 0.8275, 0.8175, 0.8125, 0.7875]),
 array([0.7975, 0.8225, 0.81  , 0.82  , 0.785 ]),
 array([0.8   , 0.8175, 0.8025, 0.815 , 0.785 ]),
 array([0.7975, 0.815 , 0.81  , 0.815 , 0.7775]),
 array([0.7925, 0.8175, 0.81  , 0.8125, 0.7825])]
In [77]:
plt.plot(ks, [s.mean() for s in score_results])
Out[77]:
[<matplotlib.lines.Line2D at 0x2a5beec4400>]
In [78]:
plt.plot(ks, [s.mean() for s in score_results])
plt.plot(ks, [s.min() for s in score_results])
plt.plot(ks, [s.max() for s in score_results])
Out[78]:
[<matplotlib.lines.Line2D at 0x2a5bef25eb8>]

Green:max, Blue:mean, Orange:min

In [81]:
from sklearn.model_selection import GridSearchCV
In [83]:
param_grid = {'classifier__n_neighbors':ks}   # the name of the step and the parameter
grid_search = GridSearchCV(classifier, param_grid, cv=10)
In [84]:
fit_result = grid_search.fit(features, target)

This finds the best parameters, which build your best model.

In [87]:
fit_result.best_estimator_.get_params()['steps']
Out[87]:
[('feature_transform', ColumnTransformer(transformers=[('categorical',
                                   Pipeline(steps=[('onehot',
                                                    OneHotEncoder(drop='first'))]),
                                   ['sex', 'education', 'race']),
                                  ('continuous',
                                   Pipeline(steps=[('scale', StandardScaler())]),
                                   ['age', 'education-num'])])),
 ('classifier', KNeighborsClassifier(n_neighbors=70))]

Here we see, the alg. chose n_neighbors=70 as the best number of neighbors.

In [88]:
dummies = pd.get_dummies(features)
In [89]:
dummies
Out[89]:
age fnlwgt education-num cap-gain cap-loss hours-per-week workclass_ ? workclass_ Federal-gov workclass_ Local-gov workclass_ Private ... native-country_ Poland native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam
0 18 181712 7 0 0 12 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1 51 197656 10 0 0 45 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
2 20 250037 10 0 0 18 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 37 210945 9 0 0 24 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
4 27 183511 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
5 35 269300 13 0 0 60 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
6 22 267174 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
7 52 284129 10 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
8 35 148903 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
9 37 22463 11 0 1977 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
10 36 380614 13 0 0 35 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
11 36 301614 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
12 25 101812 11 0 0 41 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
13 36 77820 10 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
14 46 94809 10 0 0 30 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
15 45 184581 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
16 44 168515 10 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
17 56 35373 10 0 0 30 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
18 51 172493 10 0 0 12 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
19 26 164386 9 0 0 48 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
20 60 167670 13 0 0 35 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
21 56 198388 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
22 53 196278 10 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
23 29 207473 4 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
24 34 849857 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
25 60 178792 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
26 24 37440 13 0 0 50 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
27 42 269028 16 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
28 33 220860 9 0 0 45 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
29 43 144371 9 0 0 42 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1970 36 257942 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1971 32 250585 13 0 0 50 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
1972 45 231013 13 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1973 24 196388 12 0 0 12 1 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1974 24 174391 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1975 40 96129 11 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1976 38 103408 14 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
1977 54 32778 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1978 39 163204 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1979 28 218887 9 0 0 35 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1980 52 146565 12 4865 0 30 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
1981 58 35551 13 0 0 70 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1982 29 175262 14 0 0 35 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
1983 23 130905 13 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1984 36 361888 9 0 0 45 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1985 25 231638 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1986 36 33355 10 0 0 45 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1987 33 347623 14 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1988 34 259705 10 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1989 39 188069 5 0 0 25 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1990 36 408427 15 0 0 60 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1991 22 184813 10 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1992 27 97490 10 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1993 23 242912 10 4650 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1994 35 225330 13 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1995 17 659273 7 0 0 40 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1996 29 444304 13 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1997 22 361280 13 0 0 20 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1998 29 144592 15 0 0 50 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
1999 53 124963 9 0 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0

2000 rows × 96 columns

In [90]:
for column in dummies:
    corr = dummies[column].corr(target)
    print(f'{column}: {corr}')
age: 0.22814016735704712
fnlwgt: -0.024799150492211916
education-num: 0.3397232980285314
cap-gain: 0.2359292611074695
cap-loss: 0.13823150888236002
hours-per-week: 0.22025295827104138
workclass_ ?: -0.0914829218285114
workclass_ Federal-gov: 0.04609024105854122
workclass_ Local-gov: 0.05675236260087804
workclass_ Private: -0.07325341833027024
workclass_ Self-emp-inc: 0.09392855499315302
workclass_ Self-emp-not-inc: 0.05650833472330099
workclass_ State-gov: 0.005844073191832
education_ 10th: -0.07168458661673176
education_ 11th: -0.10344112056712088
education_ 12th: -0.05177660860998759
education_ 1st-4th: -0.03570984456152662
education_ 5th-6th: -0.03943900760308151
education_ 7th-8th: -0.05676922981994202
education_ 9th: -0.05601580434755568
education_ Assoc-acdm: 0.041953938030011635
education_ Assoc-voc: 0.011651741957930185
education_ Bachelors: 0.1406096267895298
education_ Doctorate: 0.16107487701848658
education_ HS-grad: -0.14269313160032704
education_ Masters: 0.18289761028809984
education_ Preschool: -0.021840331616252977
education_ Prof-school: 0.15175772772310883
education_ Some-college: -0.03722865542451165
martial-status_ Divorced: -0.11765572251583212
martial-status_ Married-AF-spouse: -0.01782809293085055
martial-status_ Married-civ-spouse: 0.4628958131050045
martial-status_ Married-spouse-absent: -0.04377942632029975
martial-status_ Never-married: -0.32841365665784134
martial-status_ Separated: -0.07991232567626222
martial-status_ Widowed: -0.09581061789613336
occupation_ ?: -0.0914829218285114
occupation_ Adm-clerical: -0.09085273949848977
occupation_ Craft-repair: -0.042108217759389055
occupation_ Exec-managerial: 0.2261996796998491
occupation_ Farming-fishing: -0.015331657876613933
occupation_ Handlers-cleaners: -0.0891524113898124
occupation_ Machine-op-inspct: -0.02825497614377936
occupation_ Other-service: -0.15750977512984243
occupation_ Priv-house-serv: -0.028209876541202716
occupation_ Prof-specialty: 0.19831363843512714
occupation_ Protective-serv: 0.011642240962801883
occupation_ Sales: -0.015277216614338108
occupation_ Tech-support: 0.028750115069611688
occupation_ Transport-moving: -0.028940600354439825
relationship_ Husband: 0.4061575493594984
relationship_ Not-in-family: -0.19885417797573507
relationship_ Other-relative: -0.06501594009971849
relationship_ Own-child: -0.23757498804813038
relationship_ Unmarried: -0.14024050608850192
relationship_ Wife: 0.1420604346445593
race_ Amer-Indian-Eskimo: -0.008333353816972588
race_ Asian-Pac-Islander: 0.025033194367765875
race_ Black: -0.0948525355873095
race_ Other: -0.03510761894304983
race_ White: 0.07806766451714084
sex_ Female: -0.2328088422359661
sex_ Male: 0.2328088422359661
native-country_ ?: 0.0016522491510572663
native-country_ Canada: -0.013599553769360636
native-country_ China: 0.006196012124905217
native-country_ Columbia: -0.025225359280069606
native-country_ Cuba: 0.014514922783564569
native-country_ Dominican-Republic: -0.01782809293085056
native-country_ Ecuador: -0.017828092930850536
native-country_ El-Salvador: -0.030910119320210403
native-country_ England: -0.004799190614893518
native-country_ France: -0.01782809293085055
native-country_ Germany: 0.021324006502969322
native-country_ Guatemala: -0.012603211844650792
native-country_ Honduras: -0.01782809293085054
native-country_ Hong: 0.019159651738963952
native-country_ Hungary: -0.01260321184465079
native-country_ India: 0.027109411010531183
native-country_ Italy: 0.0013335626599739485
native-country_ Jamaica: -0.02522535928006971
native-country_ Japan: 0.03857552347711625
native-country_ Mexico: -0.07410292231576877
native-country_ Nicaragua: -0.02184033161625304
native-country_ Outlying-US(Guam-USVI-etc): -0.012603211844650792
native-country_ Philippines: 0.016773030554250296
native-country_ Poland: -0.025225359280069637
native-country_ Portugal: -0.017828092930850522
native-country_ Puerto-Rico: -0.01718814095077636
native-country_ Scotland: -0.01260321184465079
native-country_ South: -0.00953382656764428
native-country_ Taiwan: 0.01915965173896403
native-country_ Thailand: -0.01260321184465079
native-country_ Trinadad&Tobago: 0.019159651738963866
native-country_ United-States: 0.04836719047781544
native-country_ Vietnam: -0.030910119320210382

Correlation will work if you have something that looks like a linear.

Forward-stepwise

feature selection

In [94]:
from sklearn.tree import DecisionTreeClassifier  
In [102]:
columns = list(features.columns)
selected_features = []
scores = []
N = 10
while len(selected_features) < N:
    best_score = pd.Series([0])
    best_feature = None
    for feature in columns:
        score = cross_val_score(DecisionTreeClassifier(max_depth=10),
                               pd.get_dummies(features[selected_features + [feature]]), target)
        if score.mean() > best_score.mean():
            best_feature = feature
            best_score = score
    print(f'{best_feature}: {best_score} ({best_score.mean()})')
    columns.remove(best_feature)
    selected_features.append(best_feature)
    scores.append(best_score)
cap-gain: [0.8225 0.835  0.7975 0.7975 0.8   ] (0.8105)
cap-loss: [0.8325 0.8375 0.815  0.8    0.81  ] (0.8190000000000002)
martial-status: [0.8325 0.845  0.82   0.805  0.8125] (0.8230000000000001)
occupation: [0.8425 0.855  0.8525 0.825  0.8475] (0.8445)
sex: [0.84   0.8475 0.855  0.825  0.85  ] (0.8434999999999999)
relationship: [0.8375 0.8475 0.8425 0.8275 0.85  ] (0.841)
workclass: [0.8375 0.855  0.8375 0.835  0.84  ] (0.841)
race: [0.8375 0.8625 0.8375 0.8275 0.81  ] (0.8350000000000002)
native-country: [0.835  0.865  0.8275 0.8225 0.81  ] (0.8320000000000001)
age: [0.835  0.8425 0.835  0.8075 0.8225] (0.8285)
In [104]:
plt.plot([s.mean() for s in scores])
plt.plot([s.min() for s in scores])
plt.plot([s.max() for s in scores])
Out[104]:
[<matplotlib.lines.Line2D at 0x2a5c08323c8>]

Let's see what results we get for a completely random data set. Hint: there should be no correlation between features!

In [105]:
import numpy
In [107]:
N = 200
random_features = pd.DataFrame(numpy.random.normal(size=(N,N)))
random_target = numpy.random.choice([0, 1], size=N)
In [109]:
from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(random_features, random_target, test_size=0.2)
In [112]:
columns = list(Xtr.columns)
selected_features = []
scores = []
N = 5
while len(selected_features) < N:
    best_score = pd.Series([0])
    best_feature = None
    for feature in columns:
        score = cross_val_score(DecisionTreeClassifier(max_depth=10),
                               pd.get_dummies(Xtr[selected_features + [feature]]), ytr)
        if score.mean() > best_score.mean():
            best_feature = feature
            best_score = score
    print(f'{best_feature}: {best_score} ({best_score.mean()})')
    columns.remove(best_feature)
    selected_features.append(best_feature)
    scores.append(best_score)
91: [0.71875 0.75    0.4375  0.71875 0.46875] (0.61875)
9: [0.6875  0.59375 0.46875 0.65625 0.65625] (0.6125)
141: [0.5625  0.6875  0.46875 0.78125 0.6875 ] (0.6375)
121: [0.65625 0.71875 0.53125 0.6875  0.6875 ] (0.65625)
108: [0.6875  0.71875 0.53125 0.78125 0.6875 ] (0.68125)
In [113]:
plt.plot([s.mean() for s in scores])
Out[113]:
[<matplotlib.lines.Line2D at 0x2a5c0f12f60>]
In [114]:
model = DecisionTreeClassifier(max_depth=10).fit(Xtr[selected_features], ytr)
In [115]:
from sklearn.metrics import accuracy_score
In [117]:
accuracy_score(yte, model.predict(Xte[selected_features]))
Out[117]:
0.475

When doing feature selection:

  • Use a pipeline
  • Cross validate
  • Use your brain to see if the features make sense
In [120]:
cv_scores_r = cross_val_score(DecisionTreeClassifier(max_depth=10), random_features[selected_features], random_target, cv=30)
In [121]:
cv_scores_r
Out[121]:
array([0.71428571, 0.28571429, 0.28571429, 0.42857143, 0.42857143,
       0.14285714, 0.28571429, 0.71428571, 0.42857143, 0.57142857,
       0.71428571, 0.57142857, 0.71428571, 0.57142857, 0.57142857,
       0.42857143, 0.85714286, 0.71428571, 0.28571429, 0.28571429,
       0.66666667, 0.66666667, 0.5       , 0.33333333, 0.33333333,
       0.5       , 0.16666667, 1.        , 0.5       , 0.66666667])
In [122]:
plt.hist(cv_scores_r)
Out[122]:
(array([2., 5., 2., 4., 3., 4., 8., 0., 1., 1.]),
 array([0.14285714, 0.22857143, 0.31428571, 0.4       , 0.48571429,
        0.57142857, 0.65714286, 0.74285714, 0.82857143, 0.91428571,
        1.        ]),
 <a list of 10 Patch objects>)

R-ish-Library

In [126]:
import statsmodels.formula.api as smf
In [127]:
features
Out[127]:
age workclass fnlwgt education education-num martial-status occupation relationship race sex cap-gain cap-loss hours-per-week native-country
0 18 Private 181712 11th 7 Never-married Handlers-cleaners Own-child White Male 0 0 12 United-States
1 51 Private 197656 Some-college 10 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States
2 20 ? 250037 Some-college 10 Never-married ? Own-child White Female 0 0 18 ?
3 37 Private 210945 HS-grad 9 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 24 United-States
4 27 Private 183511 HS-grad 9 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 40 United-States
5 35 Self-emp-not-inc 269300 Bachelors 13 Never-married Other-service Not-in-family Black Female 0 0 60 United-States
6 22 Private 267174 HS-grad 9 Never-married Handlers-cleaners Own-child Black Male 0 0 40 United-States
7 52 Private 284129 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States
8 35 Private 148903 HS-grad 9 Divorced Sales Unmarried White Female 0 0 40 United-States
9 37 Private 22463 Assoc-voc 11 Married-civ-spouse Craft-repair Husband White Male 0 1977 40 United-States
10 36 Local-gov 380614 Bachelors 13 Married-civ-spouse Adm-clerical Wife White Female 0 0 35 Germany
11 36 Private 301614 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States
12 25 Private 101812 Assoc-voc 11 Married-civ-spouse Sales Husband White Male 0 0 41 United-States
13 36 Private 77820 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States
14 46 Private 94809 Some-college 10 Divorced Priv-house-serv Unmarried White Female 0 0 30 United-States
15 45 Private 184581 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States
16 44 Private 168515 Some-college 10 Divorced Adm-clerical Not-in-family White Female 0 0 40 Germany
17 56 Private 35373 Some-college 10 Divorced Other-service Not-in-family White Male 0 0 30 United-States
18 51 Private 172493 Some-college 10 Never-married Exec-managerial Not-in-family White Female 0 0 12 United-States
19 26 Private 164386 HS-grad 9 Never-married Craft-repair Own-child White Male 0 0 48 United-States
20 60 Private 167670 Bachelors 13 Married-civ-spouse Prof-specialty Husband White Male 0 0 35 United-States
21 56 Private 198388 HS-grad 9 Divorced Adm-clerical Not-in-family White Female 0 0 40 United-States
22 53 Private 196278 Some-college 10 Widowed Sales Not-in-family White Female 0 0 40 United-States
23 29 Private 207473 7th-8th 4 Married-civ-spouse Transport-moving Husband White Male 0 0 40 Mexico
24 34 Private 849857 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 Nicaragua
25 60 Private 178792 HS-grad 9 Widowed Handlers-cleaners Not-in-family White Female 0 0 40 United-States
26 24 Self-emp-not-inc 37440 Bachelors 13 Never-married Farming-fishing Unmarried White Male 0 0 50 United-States
27 42 Private 269028 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 40 France
28 33 Private 220860 HS-grad 9 Divorced Transport-moving Not-in-family White Male 0 0 45 United-States
29 43 Private 144371 HS-grad 9 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 42 United-States
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1970 36 Private 257942 HS-grad 9 Married-civ-spouse Sales Husband White Male 0 0 40 United-States
1971 32 Local-gov 250585 Bachelors 13 Never-married Prof-specialty Not-in-family White Female 0 0 50 United-States
1972 45 State-gov 231013 Bachelors 13 Divorced Protective-serv Not-in-family White Male 0 0 40 United-States
1973 24 ? 196388 Assoc-acdm 12 Never-married ? Not-in-family White Male 0 0 12 United-States
1974 24 Self-emp-not-inc 174391 HS-grad 9 Married-civ-spouse Adm-clerical Wife White Female 0 0 40 United-States
1975 40 Private 96129 Assoc-voc 11 Married-civ-spouse Tech-support Husband White Male 0 0 40 United-States
1976 38 Private 103408 Masters 14 Married-civ-spouse Exec-managerial Husband Black Male 0 0 40 ?
1977 54 State-gov 32778 HS-grad 9 Widowed Exec-managerial Not-in-family White Female 0 0 40 United-States
1978 39 Self-emp-not-inc 163204 HS-grad 9 Married-civ-spouse Sales Husband White Male 0 0 40 United-States
1979 28 Private 218887 HS-grad 9 Never-married Farming-fishing Unmarried White Female 0 0 35 United-States
1980 52 Local-gov 146565 Assoc-acdm 12 Divorced Other-service Not-in-family White Female 4865 0 30 United-States
1981 58 Self-emp-not-inc 35551 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 70 United-States
1982 29 Local-gov 175262 Masters 14 Married-civ-spouse Prof-specialty Other-relative White Male 0 0 35 United-States
1983 23 Private 130905 Bachelors 13 Never-married Sales Own-child White Female 0 0 40 United-States
1984 36 Private 361888 HS-grad 9 Married-civ-spouse Transport-moving Husband White Male 0 0 45 United-States
1985 25 Private 231638 HS-grad 9 Never-married Adm-clerical Own-child White Female 0 0 40 United-States
1986 36 Private 33355 Some-college 10 Never-married Other-service Own-child White Male 0 0 45 United-States
1987 33 Private 347623 Masters 14 Never-married Exec-managerial Unmarried White Male 0 0 40 United-States
1988 34 State-gov 259705 Some-college 10 Separated Exec-managerial Own-child White Female 0 0 40 United-States
1989 39 Private 188069 9th 5 Married-civ-spouse Transport-moving Husband White Male 0 0 25 United-States
1990 36 Self-emp-not-inc 408427 Prof-school 15 Married-civ-spouse Prof-specialty Husband White Male 0 0 60 United-States
1991 22 Private 184813 Some-college 10 Never-married Adm-clerical Own-child White Male 0 0 40 United-States
1992 27 Private 97490 Some-college 10 Divorced Craft-repair Unmarried White Female 0 0 40 United-States
1993 23 Private 242912 Some-college 10 Never-married Other-service Own-child White Female 4650 0 40 United-States
1994 35 Private 225330 Bachelors 13 Never-married Adm-clerical Not-in-family White Female 0 0 40 United-States
1995 17 ? 659273 11th 7 Never-married ? Own-child Black Female 0 0 40 Trinadad&Tobago
1996 29 Private 444304 Bachelors 13 Married-civ-spouse Prof-specialty Husband White Male 0 0 40 United-States
1997 22 Self-emp-not-inc 361280 Bachelors 13 Never-married Prof-specialty Own-child Asian-Pac-Islander Male 0 0 20 India
1998 29 Private 144592 Prof-school 15 Married-civ-spouse Prof-specialty Husband White Male 0 0 50 United-States
1999 53 Private 124963 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States

2000 rows × 14 columns

In [129]:
X = features.copy()
In [132]:
X['target'] = target
In [133]:
X
Out[133]:
age workclass fnlwgt education education-num martial-status occupation relationship race sex cap-gain cap-loss hours-per-week native-country target
0 18 Private 181712 11th 7 Never-married Handlers-cleaners Own-child White Male 0 0 12 United-States 0
1 51 Private 197656 Some-college 10 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States 1
2 20 ? 250037 Some-college 10 Never-married ? Own-child White Female 0 0 18 ? 0
3 37 Private 210945 HS-grad 9 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 24 United-States 0
4 27 Private 183511 HS-grad 9 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 40 United-States 0
5 35 Self-emp-not-inc 269300 Bachelors 13 Never-married Other-service Not-in-family Black Female 0 0 60 United-States 0
6 22 Private 267174 HS-grad 9 Never-married Handlers-cleaners Own-child Black Male 0 0 40 United-States 0
7 52 Private 284129 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States 1
8 35 Private 148903 HS-grad 9 Divorced Sales Unmarried White Female 0 0 40 United-States 0
9 37 Private 22463 Assoc-voc 11 Married-civ-spouse Craft-repair Husband White Male 0 1977 40 United-States 1
10 36 Local-gov 380614 Bachelors 13 Married-civ-spouse Adm-clerical Wife White Female 0 0 35 Germany 1
11 36 Private 301614 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States 0
12 25 Private 101812 Assoc-voc 11 Married-civ-spouse Sales Husband White Male 0 0 41 United-States 0
13 36 Private 77820 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States 1
14 46 Private 94809 Some-college 10 Divorced Priv-house-serv Unmarried White Female 0 0 30 United-States 0
15 45 Private 184581 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States 0
16 44 Private 168515 Some-college 10 Divorced Adm-clerical Not-in-family White Female 0 0 40 Germany 0
17 56 Private 35373 Some-college 10 Divorced Other-service Not-in-family White Male 0 0 30 United-States 0
18 51 Private 172493 Some-college 10 Never-married Exec-managerial Not-in-family White Female 0 0 12 United-States 0
19 26 Private 164386 HS-grad 9 Never-married Craft-repair Own-child White Male 0 0 48 United-States 0
20 60 Private 167670 Bachelors 13 Married-civ-spouse Prof-specialty Husband White Male 0 0 35 United-States 0
21 56 Private 198388 HS-grad 9 Divorced Adm-clerical Not-in-family White Female 0 0 40 United-States 0
22 53 Private 196278 Some-college 10 Widowed Sales Not-in-family White Female 0 0 40 United-States 0
23 29 Private 207473 7th-8th 4 Married-civ-spouse Transport-moving Husband White Male 0 0 40 Mexico 0
24 34 Private 849857 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 Nicaragua 0
25 60 Private 178792 HS-grad 9 Widowed Handlers-cleaners Not-in-family White Female 0 0 40 United-States 0
26 24 Self-emp-not-inc 37440 Bachelors 13 Never-married Farming-fishing Unmarried White Male 0 0 50 United-States 0
27 42 Private 269028 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 40 France 0
28 33 Private 220860 HS-grad 9 Divorced Transport-moving Not-in-family White Male 0 0 45 United-States 0
29 43 Private 144371 HS-grad 9 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 42 United-States 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1970 36 Private 257942 HS-grad 9 Married-civ-spouse Sales Husband White Male 0 0 40 United-States 0
1971 32 Local-gov 250585 Bachelors 13 Never-married Prof-specialty Not-in-family White Female 0 0 50 United-States 0
1972 45 State-gov 231013 Bachelors 13 Divorced Protective-serv Not-in-family White Male 0 0 40 United-States 0
1973 24 ? 196388 Assoc-acdm 12 Never-married ? Not-in-family White Male 0 0 12 United-States 0
1974 24 Self-emp-not-inc 174391 HS-grad 9 Married-civ-spouse Adm-clerical Wife White Female 0 0 40 United-States 0
1975 40 Private 96129 Assoc-voc 11 Married-civ-spouse Tech-support Husband White Male 0 0 40 United-States 1
1976 38 Private 103408 Masters 14 Married-civ-spouse Exec-managerial Husband Black Male 0 0 40 ? 0
1977 54 State-gov 32778 HS-grad 9 Widowed Exec-managerial Not-in-family White Female 0 0 40 United-States 0
1978 39 Self-emp-not-inc 163204 HS-grad 9 Married-civ-spouse Sales Husband White Male 0 0 40 United-States 0
1979 28 Private 218887 HS-grad 9 Never-married Farming-fishing Unmarried White Female 0 0 35 United-States 0
1980 52 Local-gov 146565 Assoc-acdm 12 Divorced Other-service Not-in-family White Female 4865 0 30 United-States 0
1981 58 Self-emp-not-inc 35551 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 70 United-States 1
1982 29 Local-gov 175262 Masters 14 Married-civ-spouse Prof-specialty Other-relative White Male 0 0 35 United-States 0
1983 23 Private 130905 Bachelors 13 Never-married Sales Own-child White Female 0 0 40 United-States 0
1984 36 Private 361888 HS-grad 9 Married-civ-spouse Transport-moving Husband White Male 0 0 45 United-States 1
1985 25 Private 231638 HS-grad 9 Never-married Adm-clerical Own-child White Female 0 0 40 United-States 0
1986 36 Private 33355 Some-college 10 Never-married Other-service Own-child White Male 0 0 45 United-States 0
1987 33 Private 347623 Masters 14 Never-married Exec-managerial Unmarried White Male 0 0 40 United-States 0
1988 34 State-gov 259705 Some-college 10 Separated Exec-managerial Own-child White Female 0 0 40 United-States 0
1989 39 Private 188069 9th 5 Married-civ-spouse Transport-moving Husband White Male 0 0 25 United-States 0
1990 36 Self-emp-not-inc 408427 Prof-school 15 Married-civ-spouse Prof-specialty Husband White Male 0 0 60 United-States 1
1991 22 Private 184813 Some-college 10 Never-married Adm-clerical Own-child White Male 0 0 40 United-States 0
1992 27 Private 97490 Some-college 10 Divorced Craft-repair Unmarried White Female 0 0 40 United-States 0
1993 23 Private 242912 Some-college 10 Never-married Other-service Own-child White Female 4650 0 40 United-States 0
1994 35 Private 225330 Bachelors 13 Never-married Adm-clerical Not-in-family White Female 0 0 40 United-States 0
1995 17 ? 659273 11th 7 Never-married ? Own-child Black Female 0 0 40 Trinadad&Tobago 0
1996 29 Private 444304 Bachelors 13 Married-civ-spouse Prof-specialty Husband White Male 0 0 40 United-States 1
1997 22 Self-emp-not-inc 361280 Bachelors 13 Never-married Prof-specialty Own-child Asian-Pac-Islander Male 0 0 20 India 0
1998 29 Private 144592 Prof-school 15 Married-civ-spouse Prof-specialty Husband White Male 0 0 50 United-States 1
1999 53 Private 124963 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States 0

2000 rows × 15 columns

In [138]:
sm_fit = smf.logit("target ~ age + Q('education-num') + sex + race", data=X).fit()  #logistic regression
Optimization terminated successfully.
         Current function value: 0.436169
         Iterations 7
In [139]:
sm_fit.summary()
Out[139]:
Logit Regression Results
Dep. Variable: target No. Observations: 2000
Model: Logit Df Residuals: 1992
Method: MLE Df Model: 7
Date: Fri, 04 Sep 2020 Pseudo R-squ.: 0.2102
Time: 16:31:32 Log-Likelihood: -872.34
converged: True LL-Null: -1104.5
Covariance Type: nonrobust LLR p-value: 3.873e-96
coef std err z P>|z| [0.025 0.975]
Intercept -7.5958 0.782 -9.710 0.000 -9.129 -6.063
sex[T. Male] 1.4136 0.150 9.412 0.000 1.119 1.708
race[T. Asian-Pac-Islander] 0.0090 0.754 0.012 0.990 -1.468 1.486
race[T. Black] -0.5895 0.735 -0.803 0.422 -2.029 0.850
race[T. Other] -0.7432 1.048 -0.709 0.478 -2.797 1.311
race[T. White] -0.1541 0.694 -0.222 0.824 -1.513 1.205
age 0.0406 0.005 8.879 0.000 0.032 0.050
Q('education-num') 0.3711 0.027 13.873 0.000 0.319 0.424