Data Mining Discussion 5 b

  • How are decision trees used for induction? Why are decision tree classifiers popular?

Decision trees are used by providing a test data set where we are trying to predict the
class label. The data is then tested between each non-leaf node where the path is traced from the root to the leaf which determines the class label. Decision trees are popular because it does not require any domain knowledge regarding the data set or have to go through setting any parameter weights therefore it is very useful to discover more information about your data set. They are also popular because they can handle multi dimensional data sets and are very easy to understand when compared to other models.

  • What do we understand by "decision tree induction"?

When discussing decision tree induction we are talking about the learning of
decision trees from labeled training sets. A decision tree is a flowchart tree structure
where each non-leaf node performs a test on each attribute and each branch represents an outcome of the test and the leaf nodes contain the class labels.

  • What is "tree pruning" and how does it work?

Tree pruning address the problem with overfitting the data. Some methods use statistical approaches in order to remove least-reliable branches. There are two common methods used to perform tree pruning. One of these methods is prepruning. Prepruning halts the construction of the tree early and decides to not further expand the current node therefore making it a leaf node. Using statistical approaches, information gain, Gini index, etc we can decide to prune or not. Postpruning is the more common approach which removes the subtrees from the entire tree and replacing them with leaves.

  • What is the Gini index and how is it calculated?

The Gini index is used in the Cart algorithm in order to measure the impurity of the
training data. In order to calculate the Gini index you must subtract the sum of of the
probability that the attribute belongs in the each class from 1.