In this note, we will review what we did last week, show how to use pipelines in sklearn, use sklearn's cross_val_score, look at forward stepwise feature selection, and be introduced to the Python library modeled to look like R

Although there will be presented very formal mathematics within probability, with measures and the like, we are free to be as formal or informal as we prefer. The idea of presenting us with these formal definitions is to show us exactly what these different concepts mean.

The most important for this course is that we learn how do calculate probabilities on the computer, and can use computers as statistical tools for analysis.

Also, goal is to show how ML is done in the industry, using data through pipelines, instead of loose numpy arrays.

Review¶

import pandas as pd
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline

#!tail -n 14 data/adult.names  #doesnt work for me on windows
colnames = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'martial-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'cap-gain',
    'cap-loss',
    'hours-per-week',
    'native-country',
    'income'
]

def read_data(n=None):
    colnames = [
        'age',
        'workclass',
        'fnlwgt',
        'education',
        'education-num',
        'martial-status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'cap-gain',
        'cap-loss',
        'hours-per-week',
        'native-country',
        'income'
    ]
    df = pd.read_csv('IN-STK5000-Notebooks-2020/data/adult.data.gz', names=colnames)
    # Here you would usually do more checks
    # is n bigger than our data set
    # how big sample size is needed, and so on
    # but for now, we just assume the general case that n is acceptable
    if n:
        df = df.sample(n)
        df.index = range(n)
        
    target = (df.income == ' >50K')*1
    df.pop('income')
    return df, target

features, target = read_data(2000)

Sometimes your training data may not include all the possible values for some categories. If this new category is included in your test set, then your model may blow up.

Pipelines¶

Let's look at using a pipeline in sklearn.

cat_columns = ['sex', 'education', 'race']
cont_columns = ['age', 'education-num']

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler

cat_trans = Pipeline(steps=[('onehot', OneHotEncoder(drop='first'))])
# if you were to not drop this first column, it would still work
# but sklearn tries to restrict some values, so its a good idea to not trust this..?

cont_trans = Pipeline(steps=[('scale', StandardScaler())])

feature_trans = ColumnTransformer(
    transformers=[('categorical', cat_trans, cat_columns),
                  ('continuous', cont_trans, cont_columns)])

classifier = Pipeline(steps=[('feature_transform', feature_trans),
                             ('classifier', KNeighborsClassifier(n_neighbors=35))])

classifier

Pipeline(steps=[('feature_transform',
                 ColumnTransformer(transformers=[('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['sex', 'education', 'race']),
                                                 ('continuous',
                                                  Pipeline(steps=[('scale',
                                                                   StandardScaler())]),
                                                  ['age', 'education-num'])])),
                ('classifier', KNeighborsClassifier(n_neighbors=35))])

from sklearn import set_config  #this is somewhat a gimmick
set_config(display='diagram')

classifier

Pipeline(steps=[('feature_transform',
                 ColumnTransformer(transformers=[('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['sex', 'education', 'race']),
                                                 ('continuous',
                                                  Pipeline(steps=[('scale',
                                                                   StandardScaler())]),
                                                  ['age', 'education-num'])])),
                ('classifier', KNeighborsClassifier(n_neighbors=35))])

ColumnTransformer(transformers=[('categorical',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(drop='first'))]),
                                 ['sex', 'education', 'race']),
                                ('continuous',
                                 Pipeline(steps=[('scale', StandardScaler())]),
                                 ['age', 'education-num'])])

['sex', 'education', 'race']

OneHotEncoder(drop='first')

['age', 'education-num']

StandardScaler()

KNeighborsClassifier(n_neighbors=35)

This could be pretty useful for seeing the structure of your pipelines.

Cross-validation¶

from sklearn.model_selection import cross_val_score
# ?cross_val_score

cv_scores = cross_val_score(classifier, features, target, cv=20)

cv_scores

array([0.79, 0.78, 0.77, 0.79, 0.76, 0.82, 0.85, 0.81, 0.82, 0.77, 0.8 ,
       0.8 , 0.84, 0.83, 0.79, 0.8 , 0.76, 0.81, 0.8 , 0.83])

cv_scores.mean()

0.8045

plt.hist(cv_scores, bins=5)

(array([1., 8., 5., 5., 1.]),
 array([0.74 , 0.766, 0.792, 0.818, 0.844, 0.87 ]),
 <a list of 5 Patch objects>)

from sklearn.utils import resample

plt.hist([resample(cv_scores).mean() for _ in range(500)])

(array([  4.,   9.,  55.,  85.,  96., 111.,  66.,  52.,  17.,   5.]),
 array([0.7865 , 0.79005, 0.7936 , 0.79715, 0.8007 , 0.80425, 0.8078 ,
        0.81135, 0.8149 , 0.81845, 0.822  ]),
 <a list of 10 Patch objects>)

This can be run over and over, and every time it will give slightly different results.

plt.hist([resample(cv_scores).mean() for _ in range(500)])
None   #this is the same thing, it's just the None hides the extra infomation about the array

ks = list(range(10, 100, 10))

score_results = [cross_val_score(Pipeline(steps=[('feature_transform', feature_trans),
                             ('classifier', KNeighborsClassifier(n_neighbors=k))]), features, target, cv=5)
                for k in ks]

score_results

[array([0.8025, 0.805 , 0.8   , 0.82  , 0.7925]),
 array([0.7875, 0.8075, 0.795 , 0.8025, 0.7875]),
 array([0.8   , 0.81  , 0.8175, 0.8025, 0.785 ]),
 array([0.8   , 0.8125, 0.82  , 0.81  , 0.785 ]),
 array([0.8025, 0.8275, 0.8175, 0.8125, 0.7875]),
 array([0.7975, 0.8225, 0.81  , 0.82  , 0.785 ]),
 array([0.8   , 0.8175, 0.8025, 0.815 , 0.785 ]),
 array([0.7975, 0.815 , 0.81  , 0.815 , 0.7775]),
 array([0.7925, 0.8175, 0.81  , 0.8125, 0.7825])]

plt.plot(ks, [s.mean() for s in score_results])

[<matplotlib.lines.Line2D at 0x2a5beec4400>]

plt.plot(ks, [s.mean() for s in score_results])
plt.plot(ks, [s.min() for s in score_results])
plt.plot(ks, [s.max() for s in score_results])

[<matplotlib.lines.Line2D at 0x2a5bef25eb8>]

Green:max, Blue:mean, Orange:min

from sklearn.model_selection import GridSearchCV

param_grid = {'classifier__n_neighbors':ks}   # the name of the step and the parameter
grid_search = GridSearchCV(classifier, param_grid, cv=10)

fit_result = grid_search.fit(features, target)

This finds the best parameters, which build your best model.

fit_result.best_estimator_.get_params()['steps']

[('feature_transform', ColumnTransformer(transformers=[('categorical',
                                   Pipeline(steps=[('onehot',
                                                    OneHotEncoder(drop='first'))]),
                                   ['sex', 'education', 'race']),
                                  ('continuous',
                                   Pipeline(steps=[('scale', StandardScaler())]),
                                   ['age', 'education-num'])])),
 ('classifier', KNeighborsClassifier(n_neighbors=70))]

Here we see, the alg. chose n_neighbors=70 as the best number of neighbors.

dummies = pd.get_dummies(features)

dummies

for column in dummies:
    corr = dummies[column].corr(target)
    print(f'{column}: {corr}')

age: 0.22814016735704712
fnlwgt: -0.024799150492211916
education-num: 0.3397232980285314
cap-gain: 0.2359292611074695
cap-loss: 0.13823150888236002
hours-per-week: 0.22025295827104138
workclass_ ?: -0.0914829218285114
workclass_ Federal-gov: 0.04609024105854122
workclass_ Local-gov: 0.05675236260087804
workclass_ Private: -0.07325341833027024
workclass_ Self-emp-inc: 0.09392855499315302
workclass_ Self-emp-not-inc: 0.05650833472330099
workclass_ State-gov: 0.005844073191832
education_ 10th: -0.07168458661673176
education_ 11th: -0.10344112056712088
education_ 12th: -0.05177660860998759
education_ 1st-4th: -0.03570984456152662
education_ 5th-6th: -0.03943900760308151
education_ 7th-8th: -0.05676922981994202
education_ 9th: -0.05601580434755568
education_ Assoc-acdm: 0.041953938030011635
education_ Assoc-voc: 0.011651741957930185
education_ Bachelors: 0.1406096267895298
education_ Doctorate: 0.16107487701848658
education_ HS-grad: -0.14269313160032704
education_ Masters: 0.18289761028809984
education_ Preschool: -0.021840331616252977
education_ Prof-school: 0.15175772772310883
education_ Some-college: -0.03722865542451165
martial-status_ Divorced: -0.11765572251583212
martial-status_ Married-AF-spouse: -0.01782809293085055
martial-status_ Married-civ-spouse: 0.4628958131050045
martial-status_ Married-spouse-absent: -0.04377942632029975
martial-status_ Never-married: -0.32841365665784134
martial-status_ Separated: -0.07991232567626222
martial-status_ Widowed: -0.09581061789613336
occupation_ ?: -0.0914829218285114
occupation_ Adm-clerical: -0.09085273949848977
occupation_ Craft-repair: -0.042108217759389055
occupation_ Exec-managerial: 0.2261996796998491
occupation_ Farming-fishing: -0.015331657876613933
occupation_ Handlers-cleaners: -0.0891524113898124
occupation_ Machine-op-inspct: -0.02825497614377936
occupation_ Other-service: -0.15750977512984243
occupation_ Priv-house-serv: -0.028209876541202716
occupation_ Prof-specialty: 0.19831363843512714
occupation_ Protective-serv: 0.011642240962801883
occupation_ Sales: -0.015277216614338108
occupation_ Tech-support: 0.028750115069611688
occupation_ Transport-moving: -0.028940600354439825
relationship_ Husband: 0.4061575493594984
relationship_ Not-in-family: -0.19885417797573507
relationship_ Other-relative: -0.06501594009971849
relationship_ Own-child: -0.23757498804813038
relationship_ Unmarried: -0.14024050608850192
relationship_ Wife: 0.1420604346445593
race_ Amer-Indian-Eskimo: -0.008333353816972588
race_ Asian-Pac-Islander: 0.025033194367765875
race_ Black: -0.0948525355873095
race_ Other: -0.03510761894304983
race_ White: 0.07806766451714084
sex_ Female: -0.2328088422359661
sex_ Male: 0.2328088422359661
native-country_ ?: 0.0016522491510572663
native-country_ Canada: -0.013599553769360636
native-country_ China: 0.006196012124905217
native-country_ Columbia: -0.025225359280069606
native-country_ Cuba: 0.014514922783564569
native-country_ Dominican-Republic: -0.01782809293085056
native-country_ Ecuador: -0.017828092930850536
native-country_ El-Salvador: -0.030910119320210403
native-country_ England: -0.004799190614893518
native-country_ France: -0.01782809293085055
native-country_ Germany: 0.021324006502969322
native-country_ Guatemala: -0.012603211844650792
native-country_ Honduras: -0.01782809293085054
native-country_ Hong: 0.019159651738963952
native-country_ Hungary: -0.01260321184465079
native-country_ India: 0.027109411010531183
native-country_ Italy: 0.0013335626599739485
native-country_ Jamaica: -0.02522535928006971
native-country_ Japan: 0.03857552347711625
native-country_ Mexico: -0.07410292231576877
native-country_ Nicaragua: -0.02184033161625304
native-country_ Outlying-US(Guam-USVI-etc): -0.012603211844650792
native-country_ Philippines: 0.016773030554250296
native-country_ Poland: -0.025225359280069637
native-country_ Portugal: -0.017828092930850522
native-country_ Puerto-Rico: -0.01718814095077636
native-country_ Scotland: -0.01260321184465079
native-country_ South: -0.00953382656764428
native-country_ Taiwan: 0.01915965173896403
native-country_ Thailand: -0.01260321184465079
native-country_ Trinadad&Tobago: 0.019159651738963866
native-country_ United-States: 0.04836719047781544
native-country_ Vietnam: -0.030910119320210382

Correlation will work if you have something that looks like a linear.

Forward-stepwise¶

feature selection¶

from sklearn.tree import DecisionTreeClassifier

columns = list(features.columns)
selected_features = []
scores = []
N = 10
while len(selected_features) < N:
    best_score = pd.Series([0])
    best_feature = None
    for feature in columns:
        score = cross_val_score(DecisionTreeClassifier(max_depth=10),
                               pd.get_dummies(features[selected_features + [feature]]), target)
        if score.mean() > best_score.mean():
            best_feature = feature
            best_score = score
    print(f'{best_feature}: {best_score} ({best_score.mean()})')
    columns.remove(best_feature)
    selected_features.append(best_feature)
    scores.append(best_score)

cap-gain: [0.8225 0.835  0.7975 0.7975 0.8   ] (0.8105)
cap-loss: [0.8325 0.8375 0.815  0.8    0.81  ] (0.8190000000000002)
martial-status: [0.8325 0.845  0.82   0.805  0.8125] (0.8230000000000001)
occupation: [0.8425 0.855  0.8525 0.825  0.8475] (0.8445)
sex: [0.84   0.8475 0.855  0.825  0.85  ] (0.8434999999999999)
relationship: [0.8375 0.8475 0.8425 0.8275 0.85  ] (0.841)
workclass: [0.8375 0.855  0.8375 0.835  0.84  ] (0.841)
race: [0.8375 0.8625 0.8375 0.8275 0.81  ] (0.8350000000000002)
native-country: [0.835  0.865  0.8275 0.8225 0.81  ] (0.8320000000000001)
age: [0.835  0.8425 0.835  0.8075 0.8225] (0.8285)

plt.plot([s.mean() for s in scores])
plt.plot([s.min() for s in scores])
plt.plot([s.max() for s in scores])

[<matplotlib.lines.Line2D at 0x2a5c08323c8>]

Let's see what results we get for a completely random data set. Hint: there should be no correlation between features!

import numpy

N = 200
random_features = pd.DataFrame(numpy.random.normal(size=(N,N)))
random_target = numpy.random.choice([0, 1], size=N)

from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(random_features, random_target, test_size=0.2)

columns = list(Xtr.columns)
selected_features = []
scores = []
N = 5
while len(selected_features) < N:
    best_score = pd.Series([0])
    best_feature = None
    for feature in columns:
        score = cross_val_score(DecisionTreeClassifier(max_depth=10),
                               pd.get_dummies(Xtr[selected_features + [feature]]), ytr)
        if score.mean() > best_score.mean():
            best_feature = feature
            best_score = score
    print(f'{best_feature}: {best_score} ({best_score.mean()})')
    columns.remove(best_feature)
    selected_features.append(best_feature)
    scores.append(best_score)

91: [0.71875 0.75    0.4375  0.71875 0.46875] (0.61875)
9: [0.6875  0.59375 0.46875 0.65625 0.65625] (0.6125)
141: [0.5625  0.6875  0.46875 0.78125 0.6875 ] (0.6375)
121: [0.65625 0.71875 0.53125 0.6875  0.6875 ] (0.65625)
108: [0.6875  0.71875 0.53125 0.78125 0.6875 ] (0.68125)

plt.plot([s.mean() for s in scores])

[<matplotlib.lines.Line2D at 0x2a5c0f12f60>]

model = DecisionTreeClassifier(max_depth=10).fit(Xtr[selected_features], ytr)

from sklearn.metrics import accuracy_score

accuracy_score(yte, model.predict(Xte[selected_features]))

0.475

When doing feature selection:

Use a pipeline
Cross validate
Use your brain to see if the features make sense

cv_scores_r = cross_val_score(DecisionTreeClassifier(max_depth=10), random_features[selected_features], random_target, cv=30)

cv_scores_r

array([0.71428571, 0.28571429, 0.28571429, 0.42857143, 0.42857143,
       0.14285714, 0.28571429, 0.71428571, 0.42857143, 0.57142857,
       0.71428571, 0.57142857, 0.71428571, 0.57142857, 0.57142857,
       0.42857143, 0.85714286, 0.71428571, 0.28571429, 0.28571429,
       0.66666667, 0.66666667, 0.5       , 0.33333333, 0.33333333,
       0.5       , 0.16666667, 1.        , 0.5       , 0.66666667])

plt.hist(cv_scores_r)

(array([2., 5., 2., 4., 3., 4., 8., 0., 1., 1.]),
 array([0.14285714, 0.22857143, 0.31428571, 0.4       , 0.48571429,
        0.57142857, 0.65714286, 0.74285714, 0.82857143, 0.91428571,
        1.        ]),
 <a list of 10 Patch objects>)

R-ish-Library¶

import statsmodels.formula.api as smf

features

X = features.copy()

X['target'] = target

X

sm_fit = smf.logit("target ~ age + Q('education-num') + sex + race", data=X).fit()  #logistic regression

Optimization terminated successfully.
         Current function value: 0.436169
         Iterations 7

sm_fit.summary()

Dep. Variable:	target	No. Observations:	2000
Model:	Logit	Df Residuals:	1992
Method:	MLE	Df Model:	7
Date:	Fri, 04 Sep 2020	Pseudo R-squ.:	0.2102
Time:	16:31:32	Log-Likelihood:	-872.34
converged:	True	LL-Null:	-1104.5
Covariance Type:	nonrobust	LLR p-value:	3.873e-96

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-7.5958	0.782	-9.710	0.000	-9.129	-6.063
sex[T. Male]	1.4136	0.150	9.412	0.000	1.119	1.708
race[T. Asian-Pac-Islander]	0.0090	0.754	0.012	0.990	-1.468	1.486
race[T. Black]	-0.5895	0.735	-0.803	0.422	-2.029	0.850
race[T. Other]	-0.7432	1.048	-0.709	0.478	-2.797	1.311
race[T. White]	-0.1541	0.694	-0.222	0.824	-1.513	1.205
age	0.0406	0.005	8.879	0.000	0.032	0.050
Q('education-num')	0.3711	0.027	13.873	0.000	0.319	0.424

	age	fnlwgt	education-num	cap-gain	cap-loss	hours-per-week	workclass_ ?	workclass_ Federal-gov	workclass_ Local-gov	workclass_ Private	...	native-country_ Poland	native-country_ Portugal	native-country_ Puerto-Rico	native-country_ Scotland	native-country_ South	native-country_ Taiwan	native-country_ Thailand	native-country_ Trinadad&Tobago	native-country_ United-States	native-country_ Vietnam
0	18	181712	7	0	0	12	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1	51	197656	10	0	0	45	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
2	20	250037	10	0	0	18	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	37	210945	9	0	0	24	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
4	27	183511	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
5	35	269300	13	0	0	60	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
6	22	267174	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
7	52	284129	10	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
8	35	148903	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
9	37	22463	11	0	1977	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
10	36	380614	13	0	0	35	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
11	36	301614	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
12	25	101812	11	0	0	41	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
13	36	77820	10	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
14	46	94809	10	0	0	30	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
15	45	184581	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
16	44	168515	10	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
17	56	35373	10	0	0	30	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
18	51	172493	10	0	0	12	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
19	26	164386	9	0	0	48	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
20	60	167670	13	0	0	35	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
21	56	198388	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
22	53	196278	10	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
23	29	207473	4	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
24	34	849857	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
25	60	178792	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
26	24	37440	13	0	0	50	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
27	42	269028	16	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
28	33	220860	9	0	0	45	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
29	43	144371	9	0	0	42	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1970	36	257942	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1971	32	250585	13	0	0	50	0	0	1	0	...	0	0	0	0	0	0	0	0	1	0
1972	45	231013	13	0	0	40	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1973	24	196388	12	0	0	12	1	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1974	24	174391	9	0	0	40	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1975	40	96129	11	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1976	38	103408	14	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
1977	54	32778	9	0	0	40	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1978	39	163204	9	0	0	40	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1979	28	218887	9	0	0	35	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1980	52	146565	12	4865	0	30	0	0	1	0	...	0	0	0	0	0	0	0	0	1	0
1981	58	35551	13	0	0	70	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1982	29	175262	14	0	0	35	0	0	1	0	...	0	0	0	0	0	0	0	0	1	0
1983	23	130905	13	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1984	36	361888	9	0	0	45	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1985	25	231638	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1986	36	33355	10	0	0	45	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1987	33	347623	14	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1988	34	259705	10	0	0	40	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1989	39	188069	5	0	0	25	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1990	36	408427	15	0	0	60	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1991	22	184813	10	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1992	27	97490	10	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1993	23	242912	10	4650	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1994	35	225330	13	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1995	17	659273	7	0	0	40	1	0	0	0	...	0	0	0	0	0	0	0	1	0	0
1996	29	444304	13	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1997	22	361280	13	0	0	20	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1998	29	144592	15	0	0	50	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0
1999	53	124963	9	0	0	40	0	0	0	1	...	0	0	0	0	0	0	0	0	1	0

	age	workclass	fnlwgt	education	education-num	martial-status	occupation	relationship	race	sex	cap-gain	cap-loss	hours-per-week	native-country
0	18	Private	181712	11th	7	Never-married	Handlers-cleaners	Own-child	White	Male	0	0	12	United-States
1	51	Private	197656	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States
2	20	?	250037	Some-college	10	Never-married	?	Own-child	White	Female	0	0	18	?
3	37	Private	210945	HS-grad	9	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	0	0	24	United-States
4	27	Private	183511	HS-grad	9	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	0	0	40	United-States
5	35	Self-emp-not-inc	269300	Bachelors	13	Never-married	Other-service	Not-in-family	Black	Female	0	0	60	United-States
6	22	Private	267174	HS-grad	9	Never-married	Handlers-cleaners	Own-child	Black	Male	0	0	40	United-States
7	52	Private	284129	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States
8	35	Private	148903	HS-grad	9	Divorced	Sales	Unmarried	White	Female	0	0	40	United-States
9	37	Private	22463	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	White	Male	0	1977	40	United-States
10	36	Local-gov	380614	Bachelors	13	Married-civ-spouse	Adm-clerical	Wife	White	Female	0	0	35	Germany
11	36	Private	301614	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States
12	25	Private	101812	Assoc-voc	11	Married-civ-spouse	Sales	Husband	White	Male	0	0	41	United-States
13	36	Private	77820	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States
14	46	Private	94809	Some-college	10	Divorced	Priv-house-serv	Unmarried	White	Female	0	0	30	United-States
15	45	Private	184581	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
16	44	Private	168515	Some-college	10	Divorced	Adm-clerical	Not-in-family	White	Female	0	0	40	Germany
17	56	Private	35373	Some-college	10	Divorced	Other-service	Not-in-family	White	Male	0	0	30	United-States
18	51	Private	172493	Some-college	10	Never-married	Exec-managerial	Not-in-family	White	Female	0	0	12	United-States
19	26	Private	164386	HS-grad	9	Never-married	Craft-repair	Own-child	White	Male	0	0	48	United-States
20	60	Private	167670	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	35	United-States
21	56	Private	198388	HS-grad	9	Divorced	Adm-clerical	Not-in-family	White	Female	0	0	40	United-States
22	53	Private	196278	Some-college	10	Widowed	Sales	Not-in-family	White	Female	0	0	40	United-States
23	29	Private	207473	7th-8th	4	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	40	Mexico
24	34	Private	849857	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	Nicaragua
25	60	Private	178792	HS-grad	9	Widowed	Handlers-cleaners	Not-in-family	White	Female	0	0	40	United-States
26	24	Self-emp-not-inc	37440	Bachelors	13	Never-married	Farming-fishing	Unmarried	White	Male	0	0	50	United-States
27	42	Private	269028	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	40	France
28	33	Private	220860	HS-grad	9	Divorced	Transport-moving	Not-in-family	White	Male	0	0	45	United-States
29	43	Private	144371	HS-grad	9	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	0	0	42	United-States
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1970	36	Private	257942	HS-grad	9	Married-civ-spouse	Sales	Husband	White	Male	0	0	40	United-States
1971	32	Local-gov	250585	Bachelors	13	Never-married	Prof-specialty	Not-in-family	White	Female	0	0	50	United-States
1972	45	State-gov	231013	Bachelors	13	Divorced	Protective-serv	Not-in-family	White	Male	0	0	40	United-States
1973	24	?	196388	Assoc-acdm	12	Never-married	?	Not-in-family	White	Male	0	0	12	United-States
1974	24	Self-emp-not-inc	174391	HS-grad	9	Married-civ-spouse	Adm-clerical	Wife	White	Female	0	0	40	United-States
1975	40	Private	96129	Assoc-voc	11	Married-civ-spouse	Tech-support	Husband	White	Male	0	0	40	United-States
1976	38	Private	103408	Masters	14	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	0	40	?
1977	54	State-gov	32778	HS-grad	9	Widowed	Exec-managerial	Not-in-family	White	Female	0	0	40	United-States
1978	39	Self-emp-not-inc	163204	HS-grad	9	Married-civ-spouse	Sales	Husband	White	Male	0	0	40	United-States
1979	28	Private	218887	HS-grad	9	Never-married	Farming-fishing	Unmarried	White	Female	0	0	35	United-States
1980	52	Local-gov	146565	Assoc-acdm	12	Divorced	Other-service	Not-in-family	White	Female	4865	0	30	United-States
1981	58	Self-emp-not-inc	35551	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	70	United-States
1982	29	Local-gov	175262	Masters	14	Married-civ-spouse	Prof-specialty	Other-relative	White	Male	0	0	35	United-States
1983	23	Private	130905	Bachelors	13	Never-married	Sales	Own-child	White	Female	0	0	40	United-States
1984	36	Private	361888	HS-grad	9	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	45	United-States
1985	25	Private	231638	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Female	0	0	40	United-States
1986	36	Private	33355	Some-college	10	Never-married	Other-service	Own-child	White	Male	0	0	45	United-States
1987	33	Private	347623	Masters	14	Never-married	Exec-managerial	Unmarried	White	Male	0	0	40	United-States
1988	34	State-gov	259705	Some-college	10	Separated	Exec-managerial	Own-child	White	Female	0	0	40	United-States
1989	39	Private	188069	9th	5	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	25	United-States
1990	36	Self-emp-not-inc	408427	Prof-school	15	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	60	United-States
1991	22	Private	184813	Some-college	10	Never-married	Adm-clerical	Own-child	White	Male	0	0	40	United-States
1992	27	Private	97490	Some-college	10	Divorced	Craft-repair	Unmarried	White	Female	0	0	40	United-States
1993	23	Private	242912	Some-college	10	Never-married	Other-service	Own-child	White	Female	4650	0	40	United-States
1994	35	Private	225330	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Female	0	0	40	United-States
1995	17	?	659273	11th	7	Never-married	?	Own-child	Black	Female	0	0	40	Trinadad&Tobago
1996	29	Private	444304	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	40	United-States
1997	22	Self-emp-not-inc	361280	Bachelors	13	Never-married	Prof-specialty	Own-child	Asian-Pac-Islander	Male	0	0	20	India
1998	29	Private	144592	Prof-school	15	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	50	United-States
1999	53	Private	124963	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States