Machine Learning¶

These notes were written while following a lecture for IN-STK5000 H20

Using sklearn and pandas to convert categorical data to dummies (0 & 1), split into traning and test sets and scale the data

import pandas as pd
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline

Always keep in mind in an industry setting, things will be much more messy than how they are presented here.

# data = pd.read_csv('IN-STK5000-Notebooks-2020/data/adult.names')
# cannot read as csv, since this file is not a csv, nor a csv zipped file
! cat IN-STK5000-Notebooks-2020/data/adult.names

| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Donor: Ronny Kohavi and Barry Becker,
|        Data Mining and Visualization
|        Silicon Graphics.
|        e-mail: ronnyk@sgi.com for questions.
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
| 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
| 45222 if instances with unknown values are removed (train=30162, test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database.  A set of
|   reasonably clean records was extracted using the following conditions:
|   ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
|
| Prediction task is to determine whether a person makes over 50K
| a year.
|
| First cited in:
| @inproceedings{kohavi-nbtree,
|    author={Ron Kohavi},
|    title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a
|           Decision-Tree Hybrid},
|    booktitle={Proceedings of the Second International Conference on
|               Knowledge Discovery and Data Mining},
|    year = 1996,
|    pages={to appear}}
|
| Error Accuracy reported as follows, after removal of unknowns from
|    train/test sets):
|    C4.5       : 84.46+-0.30
|    Naive-Bayes: 83.88+-0.30
|    NBTree     : 85.90+-0.28
|
|
| Following algorithms were later run with the following error rates,
|    all after removal of unknowns and using the original train/test split.
|    All these numbers are straight runs using MLC++ with default values.
|
|    Algorithm               Error
| -- ----------------        -----
| 1  C4.5                    15.54
| 2  C4.5-auto               14.46
| 3  C4.5 rules              14.94
| 4  Voted ID3 (0.6)         15.64
| 5  Voted ID3 (0.8)         16.47
| 6  T2                      16.84
| 7  1R                      19.54
| 8  NBTree                  14.10
| 9  CN2                     16.00
| 10 HOODG                   14.82
| 11 FSS Naive Bayes         14.05
| 12 IDTM (Decision table)   14.46
| 13 Naive-Bayes             16.12
| 14 Nearest-neighbor (1)    21.42
| 15 Nearest-neighbor (3)    20.35
| 16 OC1                     15.04
| 17 Pebls                   Crashed.  Unknown why (bounds WERE increased)
|
| Conversion of original data as follows:
| 1. Discretized agrossincome into two ranges with threshold 50,000.
| 2. Convert U.S. to US to avoid periods.
| 3. Convert Unknown to "?"
| 4. Run MLC++ GenCVFiles to generate data,test.
|
| Description of fnlwgt (final weight)
|
| The weights on the CPS files are controlled to independent estimates of the
| civilian noninstitutional population of the US.  These are prepared monthly
| for us by Population Division here at the Census Bureau.  We use 3 sets of
| controls.
|  These are:
|          1.  A single cell estimate of the population 16+ for each state.
|          2.  Controls for Hispanic Origin by age and sex.
|          3.  Controls by Race, age and sex.
|
| We use all three sets of controls in our weighting program and "rake" through
| them 6 times so that by the end we come back to all the controls we used.
|
| The term estimate refers to population totals derived from CPS by creating
| "weighted tallies" of any specified socio-economic characteristics of the
| population.
|
| People with similar demographic characteristics should have
| similar weights.  There is one important caveat to remember
| about this statement.  That is that since the CPS sample is
| actually a collection of 51 state samples, each with its own
| probability of selection, the statement only applies within
| state.


>50K, <=50K.

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

We'll need the last 14 rows to get the actual names out. this is done with a tail command.

!tail -n 14 IN-STK5000-Notebooks-2020/data/adult.names | awk -F: '{print "\x22"$1"\x22,"}'

'tail' is not recognized as an internal or external command,
operable program or batch file.

colnames = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'martial-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'cap-gain',
    'cap-loss',
    'hours-per-week',
    'native-country',
    'income'
]

df = pd.read_csv('IN-STK5000-Notebooks-2020/data/adult.data.gz', names=colnames)
df

df.describe()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
martial-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
cap-gain          32561 non-null int64
cap-loss          32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

Transforming¶

We'll look into transforming categorical data from strings to dummy 0 and 1 to be easier understood by machines.

df['target'] = df['income'] == ' >50K'

colors = ['r' if t else 'b' for t in df['target']]

df.plot.scatter('age', 'education-num', c=colors, alpha= 0.2) # aplha makes this a gradient change, not hard red and hard blue

<matplotlib.axes._subplots.AxesSubplot at 0x20425cf28d0>

df.sample(10)  # selects 10 random rows

def read_data(sample_size=None):
    colnames = [
        'age',
        'workclass',
        'fnlwgt',
        'education',
        'education-num',
        'martial-status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'cap-gain',
        'cap-loss',
        'hours-per-week',
        'native-country',
        'income'
    ]
    
    df = pd.read_csv('IN-STK5000-Notebooks-2020/data/adult.data.gz', names=colnames)
    if sample_size:
        df = df.sample(sample_size)
        df.index = range(sample_size) # this will rename the indexes of our samlpe (since theyre random anyway)
        
    target = (df['income'] == ' >50K') *1 
    return df[colnames[:-1]], target

We will use th kNN algorithm to predict the target variable. (I love this course so much! <3)

kNN should def. be tried out with dense data. The bigger k gets, the closer you get to the total average of the whole data set.

People often jump to more complex algorithms like random forest and neural networks, before trying out k- nearest neighbors, even though kNN could give a similar model, just less complex. Random forest is basically kNN, but decides how pure the neighborhood are, and chooses more based on that. (Like he said earlier, more complex, for little bit more correctness)

Talked a little about gerrymandering, since kNN is somewhat a real life example of this.

features, target = read_data(5000)  # funning our function, splitting target income out of the data set

features[['age', 'education-num']].iloc[43]   #iloc? what does this do? htink its just the 43 entry

age              21
education-num    10
Name: 43, dtype: int64

sum((features[['age', 'education-num']].iloc[42] - features[['age', 'education-num']].iloc[43])**2)  #euclidean distance

104

Note: I have different numbers than him, since my random sample is obvi gonna be diff from his

from sklearn.neighbors import DistanceMetric

DistanceMetric.get_metric('wminkowski', w=[3,1], p=2)

<sklearn.neighbors.dist_metrics.WMinkowskiDistance at 0x2042800b278>

DistanceMetric.get_metric('wminkowski', w=[3,1], p=2).pairwise(features[['age', 'education-num']].iloc[:3])

array([[ 0.        , 27.29468813, 30.2654919 ],
       [27.29468813,  0.        ,  3.        ],
       [30.2654919 ,  3.        ,  0.        ]])

There are some categories that is impossible to actually measure distance between. (bachelors vs Masters vs HS-grad). Ontologies can be used later to help extract this info.

from sklearn.preprocessing import OneHotEncoder

features['sex'].unique()

array([' Female', ' Male'], dtype=object)

encoder = OneHotEncoder(sparse=False, drop='first').fit(features[['sex']])

encoder.transform(features[['sex']])[:10]

array([[0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.],
       [1.]])

In sci-kit-learn this fit/transform is a general format of this first step toward ML.

pd.get_dummies(features[['sex']], drop_first=True)  # same thing just with pandas (honestly looks easier)

def transform_features(features):
    cat_columns = ['sex', 'education', 'race'] # just some random, probably interesting columns
    cont_columns = ['age', 'education-num']  # same we used earleir
    return pd.get_dummies(features[cat_columns+cont_columns],
                         columns = cat_columns, drop_first=True)

transformed_features = transform_features(features)

transformed_features

Now we see how the data set was transformed into categorical data, from the string values it originally had. Cool stuff!

Splitting¶

We'll now split into training/ test data sets, using a sci-kit method.

from sklearn.model_selection import train_test_split

features_train, features_test, target_train, target_test = train_test_split(
transformed_features, target, test_size=0.3)  # split into 70/30 split

Sometimes it's good to make things reproducible. If so, we should set a random seed so that it's the same results for every time its run. (Development purposes only)

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(5).fit(features_train, target_train)

model.predict(features_test)

array([0, 1, 0, ..., 0, 0, 0])

(model.predict(features_test)==target_test).mean()  #score how well the mdoel does

0.7726666666666666

from sklearn import metrics

metrics.accuracy_score(target_test, model.predict(features_test))  # same results as above

0.7726666666666666

metrics.plot_confusion_matrix(model, features_test, target_test)  # need to figure out whats wrong here

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-43-8ad87f5862fe> in <module>
----> 1 metrics.plot_confusion_matrix(model, features_test, target_test)  # need to figure out whats wrong here

AttributeError: module 'sklearn.metrics' has no attribute 'plot_confusion_matrix'

Results should look something like this

pip install --upgrade scikit-learn

Collecting scikit-learn
  Downloading https://files.pythonhosted.org/packages/92/db/8c50996186faed765392cb5ba495e8764643b71adbd168535baf0fcae5f1/scikit_learn-0.23.2-cp37-cp37m-win_amd64.whl (6.8MB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in c:\users\perha\anaconda3\lib\site-packages (from scikit-learn) (1.16.4)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in c:\users\perha\anaconda3\lib\site-packages (from scikit-learn) (0.13.2)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in c:\users\perha\anaconda3\lib\site-packages (from scikit-learn) (1.2.1)
Installing collected packages: threadpoolctl, scikit-learn
  Found existing installation: scikit-learn 0.21.2
    Uninstalling scikit-learn-0.21.2:
      Successfully uninstalled scikit-learn-0.21.2
Note: you may need to restart the kernel to use updated packages.

ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\users\\perha\\anaconda3\\lib\\site-packages\\~klearn\\decomposition\\cdnmf_fast.cp37-win_amd64.pyd'
Consider using the `--user` option or check the permissions.

probs = model.predict_proba(features_test)

probs

array([[1. , 0. ],
       [0.4, 0.6],
       [1. , 0. ],
       ...,
       [0.8, 0.2],
       [0.6, 0.4],
       [0.8, 0.2]])

plt.hist(probs[:,0])

(array([ 41.,   0.,  60.,   0., 154., 259.,   0.,   0., 357., 629.]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <a list of 10 Patch objects>)

probs[:,0] > 0.8

array([ True, False,  True, ..., False, False, False])

metrics.plot_roc_curve(model, features_test, target_test)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-59-d4476f187aaf> in <module>
----> 1 metrics.plot_roc_curve(model, features_test, target_test)

AttributeError: module 'sklearn.metrics' has no attribute 'plot_roc_curve'

Should look like this

print(metrics.classification_report(target_test, model.predict(features_test)))

              precision    recall  f1-score   support

           0       0.83      0.89      0.86      1168
           1       0.48      0.37      0.42       332

    accuracy                           0.77      1500
   macro avg       0.66      0.63      0.64      1500
weighted avg       0.75      0.77      0.76      1500

Scaling¶

https://sklearn.org/modules/preprocessing.html#preprocessing

It's much easier to understand if you use standard libraries, instead of coding you're own algos. Even though coding your own might be more "learning" for you.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(transformed_features)

scaler.transform(transformed_features)   #standard scaler standarizes ds

array([[-1.55705913, -1.60092771, -1.3938822 , ..., -0.32662574,
        -0.09422382,  0.41314165],
       [-0.90142462, -0.04413507, -1.3938822 , ..., -0.32662574,
        10.61302801, -2.42047734],
       [-0.82857634, -0.04413507, -1.3938822 , ..., -0.32662574,
        -0.09422382,  0.41314165],
       ...,
       [ 0.26414786,  0.73426125,  0.71742074, ..., -0.32662574,
        -0.09422382,  0.41314165],
       [-0.75572806, -0.04413507,  0.71742074, ..., -0.32662574,
        -0.09422382,  0.41314165],
       [ 1.13832722, -0.04413507,  0.71742074, ..., -0.32662574,
        -0.09422382,  0.41314165]])

transformed_features = pd.DataFrame(scaler.transform(transformed_features), columns=transformed_features.columns)

transformed_features.std()

age                         1.0001
education-num               1.0001
sex_ Male                   1.0001
education_ 11th             1.0001
education_ 12th             1.0001
education_ 1st-4th          1.0001
education_ 5th-6th          1.0001
education_ 7th-8th          1.0001
education_ 9th              1.0001
education_ Assoc-acdm       1.0001
education_ Assoc-voc        1.0001
education_ Bachelors        1.0001
education_ Doctorate        1.0001
education_ HS-grad          1.0001
education_ Masters          1.0001
education_ Preschool        1.0001
education_ Prof-school      1.0001
education_ Some-college     1.0001
race_ Asian-Pac-Islander    1.0001
race_ Black                 1.0001
race_ Other                 1.0001
race_ White                 1.0001
dtype: float64

transformed_features['sex_ Male']

0      -1.393882
1      -1.393882
2      -1.393882
3      -1.393882
4       0.717421
5       0.717421
6       0.717421
7      -1.393882
8       0.717421
9       0.717421
10      0.717421
11     -1.393882
12      0.717421
13     -1.393882
14      0.717421
15      0.717421
16      0.717421
17     -1.393882
18      0.717421
19      0.717421
20     -1.393882
21      0.717421
22      0.717421
23      0.717421
24      0.717421
25      0.717421
26      0.717421
27      0.717421
28     -1.393882
29      0.717421
          ...   
4970    0.717421
4971    0.717421
4972    0.717421
4973   -1.393882
4974    0.717421
4975    0.717421
4976    0.717421
4977    0.717421
4978    0.717421
4979    0.717421
4980   -1.393882
4981    0.717421
4982   -1.393882
4983    0.717421
4984    0.717421
4985    0.717421
4986    0.717421
4987    0.717421
4988    0.717421
4989   -1.393882
4990    0.717421
4991    0.717421
4992    0.717421
4993    0.717421
4994   -1.393882
4995    0.717421
4996    0.717421
4997    0.717421
4998    0.717421
4999    0.717421
Name: sex_ Male, Length: 5000, dtype: float64

df['age'].plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x20429f97cc0>

transformed_features['age'].plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x204291a1470>

Notice the two shapes are the exact same, while the scale has changed.

features_train, features_test, target_train, target_test = train_test_split(
transformed_features, target, test_size=0.3)  # split into 70/30 split

model = KNeighborsClassifier(20).fit(features_train, target_train)

metrics.accuracy_score(target_test, model.predict(features_test))

0.8086666666666666

ks = list(range(5, 60, 5))

models = [KNeighborsClassifier(k).fit(features_train, target_train) for k in ks]

scores = [metrics.accuracy_score(target_test, m.predict(features_test)) for m in models]

plt.plot(ks, scores)

[<matplotlib.lines.Line2D at 0x2042b07b048>]

train_scores = [metrics.accuracy_score(target_train, m.predict(features_train)) for m in models]

plt.plot(ks, train_scores)

[<matplotlib.lines.Line2D at 0x2042985ae10>]

The ideal k here seems to be around 30. Too big of k will cause us to loss the structure of our data, (will converge to average of entire ds). Too small, and we haven't captured what the ds is actually showing us.

	age	fnlwgt	education-num	cap-gain	cap-loss	hours-per-week
count	32561.000000	3.256100e+04	32561.000000	32561.000000	32561.000000	32561.000000
mean	38.581647	1.897784e+05	10.080679	1077.648844	87.303830	40.437456
std	13.640433	1.055500e+05	2.572720	7385.292085	402.960219	12.347429
min	17.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
25%	28.000000	1.178270e+05	9.000000	0.000000	0.000000	40.000000
50%	37.000000	1.783560e+05	10.000000	0.000000	0.000000	40.000000
75%	48.000000	2.370510e+05	12.000000	0.000000	0.000000	45.000000
max	90.000000	1.484705e+06	16.000000	99999.000000	4356.000000	99.000000

	age	workclass	fnlwgt	education	education-num	martial-status	occupation	relationship	race	sex	cap-gain	cap-loss	hours-per-week	native-country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	<=50K
5	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	<=50K
6	49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	0	16	Jamaica	<=50K
7	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	>50K
8	31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	0	50	United-States	>50K
9	42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	0	40	United-States	>50K
10	37	Private	280464	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	0	80	United-States	>50K
11	30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	India	>50K
12	23	Private	122272	Bachelors	13	Never-married	Adm-clerical	Own-child	White	Female	0	0	30	United-States	<=50K
13	32	Private	205019	Assoc-acdm	12	Never-married	Sales	Not-in-family	Black	Male	0	0	50	United-States	<=50K
14	40	Private	121772	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	Asian-Pac-Islander	Male	0	0	40	?	>50K
15	34	Private	245487	7th-8th	4	Married-civ-spouse	Transport-moving	Husband	Amer-Indian-Eskimo	Male	0	0	45	Mexico	<=50K
16	25	Self-emp-not-inc	176756	HS-grad	9	Never-married	Farming-fishing	Own-child	White	Male	0	0	35	United-States	<=50K
17	32	Private	186824	HS-grad	9	Never-married	Machine-op-inspct	Unmarried	White	Male	0	0	40	United-States	<=50K
18	38	Private	28887	11th	7	Married-civ-spouse	Sales	Husband	White	Male	0	0	50	United-States	<=50K
19	43	Self-emp-not-inc	292175	Masters	14	Divorced	Exec-managerial	Unmarried	White	Female	0	0	45	United-States	>50K
20	40	Private	193524	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	60	United-States	>50K
21	54	Private	302146	HS-grad	9	Separated	Other-service	Unmarried	Black	Female	0	0	20	United-States	<=50K
22	35	Federal-gov	76845	9th	5	Married-civ-spouse	Farming-fishing	Husband	Black	Male	0	0	40	United-States	<=50K
23	43	Private	117037	11th	7	Married-civ-spouse	Transport-moving	Husband	White	Male	0	2042	40	United-States	<=50K
24	59	Private	109015	HS-grad	9	Divorced	Tech-support	Unmarried	White	Female	0	0	40	United-States	<=50K
25	56	Local-gov	216851	Bachelors	13	Married-civ-spouse	Tech-support	Husband	White	Male	0	0	40	United-States	>50K
26	19	Private	168294	HS-grad	9	Never-married	Craft-repair	Own-child	White	Male	0	0	40	United-States	<=50K
27	54	?	180211	Some-college	10	Married-civ-spouse	?	Husband	Asian-Pac-Islander	Male	0	0	60	South	>50K
28	39	Private	367260	HS-grad	9	Divorced	Exec-managerial	Not-in-family	White	Male	0	0	80	United-States	<=50K
29	49	Private	193366	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	<=50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32531	30	?	33811	Bachelors	13	Never-married	?	Not-in-family	Asian-Pac-Islander	Female	0	0	99	United-States	<=50K
32532	34	Private	204461	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	60	United-States	>50K
32533	54	Private	337992	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	Asian-Pac-Islander	Male	0	0	50	Japan	>50K
32534	37	Private	179137	Some-college	10	Divorced	Adm-clerical	Unmarried	White	Female	0	0	39	United-States	<=50K
32535	22	Private	325033	12th	8	Never-married	Protective-serv	Own-child	Black	Male	0	0	35	United-States	<=50K
32536	34	Private	160216	Bachelors	13	Never-married	Exec-managerial	Not-in-family	White	Female	0	0	55	United-States	>50K
32537	30	Private	345898	HS-grad	9	Never-married	Craft-repair	Not-in-family	Black	Male	0	0	46	United-States	<=50K
32538	38	Private	139180	Bachelors	13	Divorced	Prof-specialty	Unmarried	Black	Female	15020	0	45	United-States	>50K
32539	71	?	287372	Doctorate	16	Married-civ-spouse	?	Husband	White	Male	0	0	10	United-States	>50K
32540	45	State-gov	252208	HS-grad	9	Separated	Adm-clerical	Own-child	White	Female	0	0	40	United-States	<=50K
32541	41	?	202822	HS-grad	9	Separated	?	Not-in-family	Black	Female	0	0	32	United-States	<=50K
32542	72	?	129912	HS-grad	9	Married-civ-spouse	?	Husband	White	Male	0	0	25	United-States	<=50K
32543	45	Local-gov	119199	Assoc-acdm	12	Divorced	Prof-specialty	Unmarried	White	Female	0	0	48	United-States	<=50K
32544	31	Private	199655	Masters	14	Divorced	Other-service	Not-in-family	Other	Female	0	0	30	United-States	<=50K
32545	39	Local-gov	111499	Assoc-acdm	12	Married-civ-spouse	Adm-clerical	Wife	White	Female	0	0	20	United-States	>50K
32546	37	Private	198216	Assoc-acdm	12	Divorced	Tech-support	Not-in-family	White	Female	0	0	40	United-States	<=50K
32547	43	Private	260761	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	Mexico	<=50K
32548	65	Self-emp-not-inc	99359	Prof-school	15	Never-married	Prof-specialty	Not-in-family	White	Male	1086	0	60	United-States	<=50K
32549	43	State-gov	255835	Some-college	10	Divorced	Adm-clerical	Other-relative	White	Female	0	0	40	United-States	<=50K
32550	43	Self-emp-not-inc	27242	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	50	United-States	<=50K
32551	32	Private	34066	10th	6	Married-civ-spouse	Handlers-cleaners	Husband	Amer-Indian-Eskimo	Male	0	0	40	United-States	<=50K
32552	43	Private	84661	Assoc-voc	11	Married-civ-spouse	Sales	Husband	White	Male	0	0	45	United-States	<=50K
32553	32	Private	116138	Masters	14	Never-married	Tech-support	Not-in-family	Asian-Pac-Islander	Male	0	0	11	Taiwan	<=50K
32554	53	Private	321865	Masters	14	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	40	United-States	>50K
32555	22	Private	310152	Some-college	10	Never-married	Protective-serv	Not-in-family	White	Male	0	0	40	United-States	<=50K
32556	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States	<=50K
32557	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States	>50K
32558	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States	<=50K
32559	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States	<=50K
32560	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	0	40	United-States	>50K

	age	workclass	fnlwgt	education	education-num	martial-status	occupation	relationship	race	sex	cap-gain	cap-loss	hours-per-week	native-country	income	target
8477	24	Private	178255	Some-college	10	Married-civ-spouse	Priv-house-serv	Wife	White	Female	0	0	40	?	<=50K	False
13782	44	Private	205474	Bachelors	13	Married-civ-spouse	Sales	Husband	White	Male	0	0	40	United-States	>50K	True
2465	39	Private	127772	HS-grad	9	Married-civ-spouse	Sales	Husband	White	Male	3103	0	44	United-States	>50K	True
30534	33	Private	87310	9th	5	Never-married	Machine-op-inspct	Not-in-family	White	Male	0	0	50	United-States	<=50K	False
16350	66	Private	192504	Masters	14	Married-civ-spouse	Sales	Husband	White	Male	0	0	40	United-States	<=50K	False
21915	37	Private	99374	10th	6	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	55	United-States	<=50K	False
29787	44	Private	186916	Some-college	10	Married-civ-spouse	Transport-moving	Husband	White	Male	0	1887	60	United-States	>50K	True
5331	53	Private	276515	Bachelors	13	Never-married	Exec-managerial	Own-child	White	Male	0	0	40	Cuba	<=50K	False
13633	68	Private	217424	HS-grad	9	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	24	United-States	<=50K	False
1166	65	Private	350498	Bachelors	13	Married-civ-spouse	Adm-clerical	Husband	White	Male	10605	0	20	United-States	>50K	True

	sex_ Male
0	0
1	0
2	0
3	0
4	1
5	1
6	1
7	0
8	1
9	1
10	1
11	0
12	1
13	0
14	1
15	1
16	1
17	0
18	1
19	1
20	0
21	1
22	1
23	1
24	1
25	1
26	1
27	1
28	0
29	1
...	...
4970	1
4971	1
4972	1
4973	0
4974	1
4975	1
4976	1
4977	1
4978	1
4979	1
4980	0
4981	1
4982	0
4983	1
4984	1
4985	1
4986	1
4987	1
4988	1
4989	0
4990	1
4991	1
4992	1
4993	1
4994	0
4995	1
4996	1
4997	1
4998	1
4999	1