Like kNN, decision trees are very intuitive. The book uses 20 questions as an analogy. Decisions are also easy to visualize, so that we can double check the decision by following the tree. There is a whole section on how to plot trees using Matplotlib that I skipped. The book uses the well-known contact lens data set to illustrate the decision tree in action. Again, I would have liked to see a more interesting data set to get a better idea of when a decision tree would be a good choice. The actual algorithm used is ID3, which doesn’t do pruning, so there can be issues with overfitting. In terms of implementation, this proved more difficult than kNN. Part of this was due to the weak explanation of entropy and information gain.

On that note, Udacity recently released an Intro to Machine Learning course that does a very good job of explaining decision trees. The videos on entropy and information gain are clearer than explanations from various sections of books that I’ve read on the algorithm. Udacity does a good job of focusing on intuition. There’s not much theory, math, or even implementation, which is fine for an introductory course. Scikit Learn is used for the hands-on exercises. This is nice as well since these skills can immediately be applied to a real world situation. Scikit Learn has an especially nice API design. Applying basic algorithms is essentially the same three lines.

**Decision trees with Scikit Learn**

**Naive Bayes with Scikit Learn**

**Support Vector Classification with Scikit Learn**

With that said the hard part is the data wrangling and feature set development. Also experimenting with and learning about the multitude of parameters for each algorithm is kind of daunting.

Overall, Udacity’s Intro to Machine Learning course is well done so far, if a bit too easy as is generally the case for their courses. The mini projects don’t require much effort at all, but they do highlight important nuances about the algorithms.

Back to Machine Learning in Action. I like that dead-simple examples are used when walking through the initial implementation of the algorithms. Decision trees are no different. The example given is a data set that determines whether an animal is a fish or not based on two features: ‘Can survive without coming to surface?’ and ‘Has flippers’.

Can survive without coming to surface? | Has flippers? | Fish? |
---|---|---|

Yes | Yes | Yes |

Yes | Yes | Yes |

Yes | No | No |

No | Yes | No |

No | No | No |

The hard part with decision trees is understanding entropy and information gain. I think most people get the idea of how a decision tree works when the tree diagram is shown and you can follow the steps that the algorithm makes.

Entropy and information gain are important because they decide on the optimal split for a tree. In general, a shallow tree is better than a deep tree.

Entropy can be described as the purity of the data. So pure data would have the same classification for all examples. If we had an even split of labels, then the data would be impure.

**Entropy equation**

In this equation `p`

stands for the probability of `xi`

, where `xi`

is a distinct label in the training set.

In the example data set, there are two classifications, fish or not fish.

First, what is the probability that one of the examples is a fish? Answer is `2/5`

.

Second, what is the probability that one of the examples is not a fish? Answer is `3/5`

.

Now we have enough to calculate the entropy of the entire data set and can just plug the numbers into the equation:

**Deciding on what feature to split on**

The decision to split on a particular feature is based on information gain. We want to split on the feature that provides the most information gain.

Continuing with the fish data set, we can visualize the information gain as follows:

**Feature 1: Can survive without coming to surface?**

There are two options, Yes/No, for this feature. This means we will have two branches.

Branch 1 (Yes):

</table> Branch 2 (No):Can survive without coming to surface? | Fish? |
---|---|

Yes | Yes |

Yes | Yes |

Yes | No |

Can survive without coming to surface? | Fish? |
---|---|

No | No |

No | No |

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

import math

def loadDataSet(filepath):

"""

Loads data set as a list of lists

Args:

filepath: Path to data file in TSV format

Returns:

Data set as list of lists

"""

with open(filepath, 'r') as infile:

return [line.strip().split('\t') for line in infile]

def calcShannonEntropy(dataSet):

"""

Calculates Shannon entropy of classifications in data

Args:

dataSet: Data set

Return:

Entropy of data set

"""

total = float(len(dataSet))

labels = {}

for data in dataSet:

if data[-1] not in labels:

labels[data[-1]] = 0.0

labels[data[-1]] += 1

entropy = 0

for count in labels.values():

entropy -= (count/total) * math.log(count/total, 2)

return entropy

def splitDataSet(dataSet, axis, value):

"""

Splits data set on a given column containing a specified value.

Returns the data set minus the specified column.

Args:

dataSet: Data set

axis: Column of value to split on

value: Value to split on

Returns:

Data set split by value and minus the specified column.

"""

subSet = []

for data in dataSet:

if data[axis] == value:

subSet.append(data[:axis] + data[axis + 1:])

return subSet

def chooseBestFeatureToSplit(dataSet):

"""

Split on the feature with the largest information gain.

Args:

dataSet: Data set

Returns:

Column index of best feature

"""

total = float(len(dataSet))

features = [{} for i in xrange(len(dataSet[0][:-1]))]

baseGain = calcShannonEntropy(dataSet)

infoGain = [0] * len(features)

for data in dataSet:

result = data[-1]

for i, feature in enumerate(data[:-1]):

if feature not in features[i]:

features[i][feature] = []

features[i][feature].append((feature, result))

for i, feature in enumerate(features):

entropy = 0

for branch in feature.values():

entropy += len(branch) / total * calcShannonEntropy(branch)

infoGain[i] = baseGain - entropy

maxGain = max(infoGain)

return infoGain.index(maxGain)

def createTree(dataSet, labels):

"""

Creates decision tree using ID3 algorithm

Args:

dataSet: Data set

labels: Labels of features/columns

Returns:

Decision tree to use for classification

"""

classifications = [example[-1] for example in dataSet]

if all(classifications[0] == classification for classification in classifications):

return classifications[0]

if len(dataSet[0]) == 1:

return max(set(classifications), key=classifications.count)

bestFeature = chooseBestFeatureToSplit(dataSet)

tree = {labels[bestFeature]: {}}

for value in set([example[bestFeature] for example in dataSet]):

subset = splitDataSet(dataSet, bestFeature, value)

tree[labels[bestFeature]][value] = createTree(

subset, labels[:bestFeature] + labels[bestFeature + 1:])

return tree

def classify(inputData, tree, labels):

"""

Classify given data using decision tree

Args:

inputData: Input data to classify

tree: Decision tree

labels: Labels for features/columns

"""

if not isinstance(tree, dict):

return tree

label = tree.keys()[0]

labelIndex = labels.index(label)

return classify(inputData, tree[label][inputData[labelIndex]], labels)

def main():

filepath = '../sample/Ch03/lenses.txt'

labels = ['age', 'prescription', 'astigmatic', 'tearRate']

dataSet = loadDataSet(filepath)

tree = createTree(dataSet, labels)

print tree

if __name__ == '__main__':

main()