Decision Trees

Used to decide what category an element, represented by a vector of attribute values, belongs to.
The attribute values must be discrete
The categories are Boolean (which can be used to model non-Boolean)

Example - "Will the movie be a success?"

                         CountryOfOrigin
                       /        |       \
                   USA/   Europe|        \Other
            _________/          |         \
     BigStar                  Genre     Failure
      /  \                  /   |   \
  yes/    \no        comedy/ romance \scifi
    /      \              /     |     \
Success  Failure    Success   Failure  Failure

Decision tree learning
- Supervised learning of trees, based on training data
- The goal is two build a tree that is as shallow as possible, to make quickest classifications (e.g., the example above has depth two although three attributes are available).
- Best known is ID3
  - Builds tree top down, selecting attribute that provides the most information gain.
- Information gain
  - Information gain is the difference between the entropy before and after a decision.
  - Entropy = -pP * log2(pP) -pN * log2(pN)
    pP is the proportion of positive (training) examples
    pN is the proportion of negative (training) examples
  - Entropy is minimal (0) when all examples are positive or negative, maximal (1) when half are positive and half are negative.
  - The entropy of a set of sets is the weighted sum of the entropies of the sets.
  - Example:
```
Film CountryOfOrigin BigStar Genre Success/Failure
--------------------------------------------------
1    USA             yes     scifi    Success
2    USA             no      comedy   Failure
3    USA             yes     comedy   Success
4    Europe          no      comedy   Success
5    Europe          yes     scifi    Failure
6    Europe          yes     romance  Failure
7    Australia       yes     comedy   Failure
8    Brazil          no      scifi    Failure
9    Europe          yes     comedy   Success
10   USA             yes     comedy   Success
```
    The entropy is 1.0, because there are 5 successes and 5 failures.
    - Information gain for CountryOfOrigin:
      Entropy(USA) = -0.75 * log2(0.75) - 0.25 * log2(0.25) = 0.811
      Entropy(Europe) = 1.0 (2 of each)
      Entropy(Other) = 0.0 (2 Failures)
      Entropy(CountryOfOrigin) = 0.4 * 0.811 + 0.4 * 1.0 + 0.2 * 0.0 = 0.7244
      InformationGain(CountryOfOrigin) = 1.0 - 0.7244 = 0.2756
    - Information gain for BigStar: 0.038
    - Information gain for Genre: 0.162
    - Best decision is CountryOfOrigin
- To build the tree, repeatedly do leaves until no entropy is left
  - For USA films, information gain for BigStar:
    Entropy(yes) = 0.0 (all Success)
    Entropy(no) = 0.0 (all Failure)
    Entropy(USA:BigStar) = 0.0
    InformationGain(USA:BigStar) = 0.811 - 0.0 = 0.811
  - For USA films, information gain for Genre:
    Entropy(scifi) = 0.0 (all Success)
    Entropy(comedy) = -0.66 * log2(0.66) -0.33 * log2(0.33) = 0.92
    Entropy(romance) = NA
    Entropy(USA:Genre) = 0.25 * 0.0 + 0.75 * 0.92 + 0.00 * NA = 0.69
    InformationGain(USA:Genre) = 0.811 - 0.69 = 0.121
  - For USA films, best decision is BigStar, and no entropy is left
  - For Europe films, best decision is Genre, and no entropy is left
- Once the decision tree is built new data can be classified (quickly).

Decision Trees

Exam Style Questions