Sections 2.2 and 2.3 of Machine Learning in Action go over two more examples. These examples are bit more realistic, but as a reader I still get the feeling that they are bit contrived. In my brief experience with machine learning, the challenge is data wrangling. Often times the data is a mess. There are missing values, not enough data points, data that would be useful but was not collected. With that said, these two examples do provide examples of potential data wrangling challenges, in particular: normalization and image data.

The first example involves a dating dataset that includes three features:

- Number of frequent flyer miles earned per year
- Percentage of time spent playing video games
- Liters of ice cream consumed per week

There are also three classifications:

- “Didn’t Like”
- “Like in small doses”
- “Like in large doses”

The features used here help illustrate the wide range of values that data can come in. This is different from the first example that used x and y coordinates. In this case frequent flyer miles can be in the thousands, whereas percentage of of time playing video games is between 0 and 100. And finally, most people aren’t going to consume more than a few liters of ice cream a week. Without normalization, the frequent flyer miles will have more weight since those values are largest. We want the values to be equal in weight, hence the normalization.

The normalization calculation is straightforward. We want all values to be between 0 and 1, so we can use the following equation:

```
norm_value = (value - min_value) / (max_value - min_value)
```

The second example involves handwritten digit recognition. Each digit is a 32x32 black and white image. For simplicity Harrington has extracted all the image data and converted them to text files.

Each pixel in the image is considered one feature, giving us 1,024 features to calculate. From there we can simply plug the data into the kNN algorithm to classify the digits. It ends up working very well with an error rate of 0.011628 or 1.2%.

Error rate is calculated as total errors divided by the number of examples. In this case 11 errors were made out of 1,000 examples.

For the dating data set example, scatter plots are plotted with matplotlib. This is to help visualize possible clusters in the data to provide intuition on what may occur before running kNN. I ended up using the python port of ggplot to graph the scatter plots since I find that interface a bit simpler and it looks better. Here are the graphs.

The third scatter plot of video games to frequent flyer miles shows three clear clusters, while the other two graphs are more mixed.

For examples 2.2 and 2.3, I rewrote the kNN algorithm using Pandas. The book uses pure numpy.

Here is the implementation with comments removed:

1

2

3

4

5

6

7

8

def classify(input_data, training_set, labels, k=1):

distance_diff = training_set - input_data

distance_squared = distance_diff**2

distance = distance_squared.sum(axis=1)**0.5

distance_df = pd.concat([distance, labels], axis=1)

distance_df.sort(columns=[0], inplace=True)

top_knn = distance_df[:k]

return top_knn[1].value_counts().index.values[0]

For each example subtract the corresponding feature from the input data.

Next square the difference of each feature.

Then take the square root of the squared distance of each feature.

Now combine the classification with distance calculation of each example.

With the classifications linked to the distance, we can sort the array from closest to farthest.

Extract the top k closest examples.

Return the classification of the most common neighbors

Here is the full kNN implementation using Pandas. The dating data set is used here:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

import itertools

from ggplot import ggplot, aes, geom_point

import numpy as np

import pandas as pd

dating_test_set = '../sample/Ch02/datingTestSet.txt'

column_names = [

'Number of frequent flyer miles earned per year',

'Percentage of time spent playing video games',

'Liters of ice cream consumed per week',

'Category'

]

def load_file(filepath):

"""

Loads data in tab-separated format

Args:

filepath: Location of data file

Returns:

Data and labels extracted from text file

"""

data = []

labels = []

with open(filepath) as infile:

for line in infile:

row = line.strip().split('\t')

data.append(row[:-1])

labels.append(row[-1])

return data, labels

def normalize(df):

"""

Normalizes data to give equal weight to each features.

General formula:

norm_value = (value - min_value) / (max_value - min_value)

Args:

df: Pandas data frame with unnormalized data

Returns:

Normalized dataframe, range of values, min values

"""

min_values = df.min()

max_values = df.max()

range_values = max_values - min_values

norm_df = (df - min_values) / range_values

return norm_df, range_values, min_values

def classify(input_data, training_set, labels, k=1):

"""

Uses kNN algorithm to classify input data given a set of

known data.

Args:

input_data: Pandas Series of input data

training_set: Pandas Data frame of training data

labels: Pandas Series of classifications for training set

k: Number of neighbors to use

Returns:

Predicted classification for given input data

"""

distance_diff = training_set - input_data

distance_squared = distance_diff**2

distance = distance_squared.sum(axis=1)**0.5

distance_df = pd.concat([distance, labels], axis=1)

distance_df.sort(columns=[0], inplace=True)

top_knn = distance_df[:k]

return top_knn[1].value_counts().index.values[0]

def plot(df, x, y, color):

"""

Scatter plot with two of the features (x, y) grouped by classification (color)

Args:

df: Dataframe of data

x: Feature to plot on x axis

y: Feature to plot on y axis

color: Group by this column

"""

print(ggplot(df, aes(x=x, y=y, color=color)) + geom_point())

def main():

# Load data

raw_data, raw_labels = load_file(dating_test_set)

# Convert data to Pandas data structures

labels = pd.Series(raw_labels, name=column_names[3])

df = pd.DataFrame.from_records(np.array(raw_data, np.float32), columns=column_names[:3])

df[column_names[3]] = labels

# Normalize data since ranges of values are different

norm_df, range_values, min_values = normalize(df)

# Use first 10% of data for testing

num_test_rows = int(norm_df.shape[0] * .1)

# 90% training data

training_df = norm_df[num_test_rows:]

training_labels = labels[num_test_rows:]

# 10% training data

test_df = norm_df[:num_test_rows]

test_labels = labels[:num_test_rows]

# Apply kNN algorithm to all test data

result_df = test_df.apply(lambda row: classify(row, training_df, training_labels, k=3), axis=1)

# Calculate the number of correct predictions

error_df = result_df == test_labels

print error_df.value_counts()

if __name__ == '__main__':

main()