Splitting datasets one feature at a time: decision trees

Posted on March 15, 2022 by Zulqarnain

Introduction

decision trees

Have you ever played a game called Twenty Questions? If not, the game works like this: One person thinks of some object and players try to guess the object. Players are allowed to ask 20 questions and receive only yes or no answers. In this game, the people asking the questions are successively splitting the set of objects they can deduce. A decision tree works just like the game Twenty Questions; you give it a bunch of data and it generates answers to the game.

The decision tree is one of the most commonly used classification techniques; recent surveys claim that it’s the most commonly used technique.1 You don’t have to know much about machine learning to understand how it works.

If you’re not already familiar with decisions trees, the concept is straightforward. Chances are good that you’ve already seen a decision tree without knowing it. Figure shows a flowchart, which is a decision tree. It has decision blocks (rectangles) and terminating blocks (ovals) where some conclusion has been reached. The right and left arrows coming out of the decision blocks are known as branches, and they can lead to other decision blocks or to a terminating block. In this particular example, I made a hypothetical email classification system, which first checks the domain of the sending email address. If this is equal to myEmployer.com, it will classify the email as “Email to read when bored.”

If it isn’t from that domain, it checks to see if the body of the email contains the word hockey. If the email contains the word hockey, then this email is classified as “Email from friends; read immediately”; if the body doesn’t contain the word hockey, then it gets classified as “Spam; don’t read.”

Decision trees

Pros: Computationally cheap to use, easy for humans to understand learned results, missing values.
Cons: Prone to overfitting
Works with: Numeric values, nominal values

Now that you know a little of what decision trees are good for, we’re going to get into the process of building them from nothing but a pile of data. In the first section, we’ll discuss methods used to construct trees and start writing code to construct a tree.
Next, we’ll address some metrics that we can use to measure the algorithm’s success. Finally, we’ll use recursion to build our classifier and plot it using Matplotlib. When we have the classifier working, we’ll take some data of a contact lens prescription and use our classifier to try to predict what lenses people will need.

To build a decision tree, you need to make a first decision on the dataset to dictate which feature is used to split the data. To determine this, you try every feature and measure which split will give you the best results. After that, you’ll split the dataset into subsets. The subsets will then traverse down the branches of the first decision node. If the data on the branches is the same class, then you’ve properly classified it and don’t need to continue splitting it. If the data isn’t the same, then you need to repeat the splitting process on this subset.

			
				    """"
                    Check if every item in the dataset is in the same class:
                    If so return the class label
                    Else 
                    find the best feature to split the data
                    split the dataset 
                    create a branch node
                    for each split
                    call createBranch and add the result to the branch node
                    return branch node
				
			
			

General approach to decision trees

  • Collect: Any method.
  • Prepare: This tree-building algorithm works only on nominal values
  • Analyze: Any method. You should visually inspect the tree after it is built
  • Train: Construct a tree data structure.
  • Test: Calculate the error rate with the learned tree.
  • Use: This can be used in any supervised learning task.
  • See the data in table . It contains five animals pulled from the sea and asks if they can survive without coming to the surface and if they have flippers. We would like to classify these animals into two classes: fish and not fish. Now we want to decide whether we should split the data based on the first feature or the second feature. To answer this question, we need some quantitative way of determining how to split the data. We’ll discuss that next.

    The change in information before and after the split is known as the information gain. When you know how to calculate the information gain, you can split your data across every feature to see which split gives you the highest information gain. The split with the highest information gain is your best option. Before you can measure the best split and start splitting our data, you need to know how to calculate the information gain. The measure of information of a set is known as the Shannon entropy, or just entropy for short. Its name comes from the father of information theory, Claude Shannon.

            
                    """"
                    from math import log
                    def calcShannonEnt(dataSet):
                        numEntries = len(dataSet)
                        labelCounts = {}
                        for featVec in dataSet: 
                            currentLabel = featVec[-1] 
                            if currentLabel not in labelCounts.keys():
                                labelCounts[currentLabel] = 0 
                                labelCounts[currentLabel] += 1 
                                shannonEnt = 0.0
                            for key in labelCounts:
                                prob = float(labelCounts[key])/numEntries
                                shannonEnt -= prob * log(prob,2) 
                        return shannonEnt