(→Three crucial questions to answer) |
(→Three crucial questions to answer) |
||
Line 25: | Line 25: | ||
Impurity is large when the training data contain equal percentages of each class | Impurity is large when the training data contain equal percentages of each class | ||
− | <math> P(\omega _i) = \frac{1}{C}; for all i </math> | + | <math> P(\omega _i) = \frac{1}{C} </math>; for all <math>i</math> |
+ | |||
+ | Let <math> I </math> denote the impurity. Impurity can be defined in the following ways: | ||
+ | |||
+ | * "Entropy Impurity": | ||
+ | |||
+ | <math>I = \sum_{j}P(\omega _j)\log_2P(\omega _j)</math> |
Revision as of 21:49, 1 April 2008
When the number of categories, c is big, decision tress are particularly good.
Example: Consider the query "Identify the fruit" from a set of c=7 categories {watermelon, apple, grape, lemon, grapefruit, banana, cherry} .
One possible decision tree based on simple queries is the following:
- To insert the decision tree example on fruits from class**
Three crucial questions to answer
For constructing a decision tree, for a given classification problem, we have to answer these three questions
1) Which question shoud be asked at a given node -"Query Selection"
2) When should we stop asking questions and declare the node to be a leaf -"When should we stop splitting"
3) Once a node is decided to be a leaf, what category should be assigned to this leaf -"Leaf classification"
We shall discuss questions 1 and 2 (3 being very trivial)
Need to define 'impurity' of a dataset such that $ impurity = 0 $ when all the training data belongs to one class.
Impurity is large when the training data contain equal percentages of each class
$ P(\omega _i) = \frac{1}{C} $; for all $ i $
Let $ I $ denote the impurity. Impurity can be defined in the following ways:
- "Entropy Impurity":
$ I = \sum_{j}P(\omega _j)\log_2P(\omega _j) $