Data Mining Discussion 5 a

What is classification? What is a classifier?

Classification is a data analysis task in which a classifier (a model) is created to predict categorical labels (class labels), such as “popular” or “unpopular”, “safe” or “risky”.

How does classification work?

Classification is a two-step process. First, the classification model is built, which describes a predetermined set of concepts or data classes (called the learning step). Secondly, the model is applied to the given data to predict class labels for it (called the classification step).

Compare supervised learning vs. unsupervised learning.

In supervised learning, the classifier learns by being told which class each training tuple belongs to. In unsupervised learning, the class label of each training tuple is not known. Instead, clustering is used to determine groups of similar tuples.

What is a training set and how is it used?

A training set is a set of training tuples which is made up of database tuples and their class labels. The classification algorithm analyzes it during the learning step to “learn from” it and build the classifier.

What is a test set and how is it used?

A test set consists of test tuples and their class labels, which are independent of the training tuples. It is used to check the accuracy of a classifier, which is the percentage of test set tuples that are correctly classified by it.

Data Mining Discussion 5 a

Data Mining Discussion 6 d

Data Mining Discussion 6 c

Data Mining Discussion 6 a