These notes were written while following a lecture for IN-STK5000 H20
Using sklearn and pandas to convert categorical data to dummies (0 & 1), split into traning and test sets and scale the data
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
Always keep in mind in an industry setting, things will be much more messy than how they are presented here.
# data = pd.read_csv('IN-STK5000-Notebooks-2020/data/adult.names')
# cannot read as csv, since this file is not a csv, nor a csv zipped file
! cat IN-STK5000-Notebooks-2020/data/adult.names
We'll need the last 14 rows to get the actual names out. this is done with a tail command.
!tail -n 14 IN-STK5000-Notebooks-2020/data/adult.names | awk -F: '{print "\x22"$1"\x22,"}'
colnames = [
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'martial-status',
'occupation',
'relationship',
'race',
'sex',
'cap-gain',
'cap-loss',
'hours-per-week',
'native-country',
'income'
]
df = pd.read_csv('IN-STK5000-Notebooks-2020/data/adult.data.gz', names=colnames)
df
df.describe()
df.info()
We'll look into transforming categorical data from strings to dummy 0 and 1 to be easier understood by machines.
df['target'] = df['income'] == ' >50K'
colors = ['r' if t else 'b' for t in df['target']]
df.plot.scatter('age', 'education-num', c=colors, alpha= 0.2) # aplha makes this a gradient change, not hard red and hard blue
df.sample(10) # selects 10 random rows
def read_data(sample_size=None):
colnames = [
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'martial-status',
'occupation',
'relationship',
'race',
'sex',
'cap-gain',
'cap-loss',
'hours-per-week',
'native-country',
'income'
]
df = pd.read_csv('IN-STK5000-Notebooks-2020/data/adult.data.gz', names=colnames)
if sample_size:
df = df.sample(sample_size)
df.index = range(sample_size) # this will rename the indexes of our samlpe (since theyre random anyway)
target = (df['income'] == ' >50K') *1
return df[colnames[:-1]], target
We will use th kNN algorithm to predict the target variable. (I love this course so much! <3)
kNN should def. be tried out with dense data. The bigger k gets, the closer you get to the total average of the whole data set.
People often jump to more complex algorithms like random forest and neural networks, before trying out k- nearest neighbors, even though kNN could give a similar model, just less complex. Random forest is basically kNN, but decides how pure the neighborhood are, and chooses more based on that. (Like he said earlier, more complex, for little bit more correctness)
Talked a little about gerrymandering, since kNN is somewhat a real life example of this.
features, target = read_data(5000) # funning our function, splitting target income out of the data set
features[['age', 'education-num']].iloc[43] #iloc? what does this do? htink its just the 43 entry
sum((features[['age', 'education-num']].iloc[42] - features[['age', 'education-num']].iloc[43])**2) #euclidean distance
Note: I have different numbers than him, since my random sample is obvi gonna be diff from his
from sklearn.neighbors import DistanceMetric
DistanceMetric.get_metric('wminkowski', w=[3,1], p=2)
DistanceMetric.get_metric('wminkowski', w=[3,1], p=2).pairwise(features[['age', 'education-num']].iloc[:3])
There are some categories that is impossible to actually measure distance between. (bachelors vs Masters vs HS-grad). Ontologies can be used later to help extract this info.
from sklearn.preprocessing import OneHotEncoder
features['sex'].unique()
encoder = OneHotEncoder(sparse=False, drop='first').fit(features[['sex']])
encoder.transform(features[['sex']])[:10]
In sci-kit-learn this fit/transform is a general format of this first step toward ML.
pd.get_dummies(features[['sex']], drop_first=True) # same thing just with pandas (honestly looks easier)
def transform_features(features):
cat_columns = ['sex', 'education', 'race'] # just some random, probably interesting columns
cont_columns = ['age', 'education-num'] # same we used earleir
return pd.get_dummies(features[cat_columns+cont_columns],
columns = cat_columns, drop_first=True)
transformed_features = transform_features(features)
transformed_features
Now we see how the data set was transformed into categorical data, from the string values it originally had. Cool stuff!
We'll now split into training/ test data sets, using a sci-kit method.
from sklearn.model_selection import train_test_split
features_train, features_test, target_train, target_test = train_test_split(
transformed_features, target, test_size=0.3) # split into 70/30 split
Sometimes it's good to make things reproducible. If so, we should set a random seed so that it's the same results for every time its run. (Development purposes only)
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(5).fit(features_train, target_train)
model.predict(features_test)
(model.predict(features_test)==target_test).mean() #score how well the mdoel does
from sklearn import metrics
metrics.accuracy_score(target_test, model.predict(features_test)) # same results as above
metrics.plot_confusion_matrix(model, features_test, target_test) # need to figure out whats wrong here
Results should look something like this
pip install --upgrade scikit-learn
probs = model.predict_proba(features_test)
probs
plt.hist(probs[:,0])
probs[:,0] > 0.8
metrics.plot_roc_curve(model, features_test, target_test)
Should look like this
print(metrics.classification_report(target_test, model.predict(features_test)))
https://sklearn.org/modules/preprocessing.html#preprocessing
It's much easier to understand if you use standard libraries, instead of coding you're own algos. Even though coding your own might be more "learning" for you.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(transformed_features)
scaler.transform(transformed_features) #standard scaler standarizes ds
transformed_features = pd.DataFrame(scaler.transform(transformed_features), columns=transformed_features.columns)
transformed_features.std()
transformed_features['sex_ Male']
df['age'].plot.hist()
transformed_features['age'].plot.hist()
Notice the two shapes are the exact same, while the scale has changed.
features_train, features_test, target_train, target_test = train_test_split(
transformed_features, target, test_size=0.3) # split into 70/30 split
model = KNeighborsClassifier(20).fit(features_train, target_train)
metrics.accuracy_score(target_test, model.predict(features_test))
ks = list(range(5, 60, 5))
models = [KNeighborsClassifier(k).fit(features_train, target_train) for k in ks]
scores = [metrics.accuracy_score(target_test, m.predict(features_test)) for m in models]
plt.plot(ks, scores)
train_scores = [metrics.accuracy_score(target_train, m.predict(features_train)) for m in models]
plt.plot(ks, train_scores)
The ideal k here seems to be around 30. Too big of k will cause us to loss the structure of our data, (will converge to average of entire ds). Too small, and we haven't captured what the ds is actually showing us.