Thursday, April 16, 2015

Data Mining by Classification and Prediction

A bank loans officer needs analysis of her data in order to learn which loan applicants are "safe" and which are "risky" for the bank. A marketing manager at a company that needs data analysis to help guess whether a customer with a given profile will buy a new computer. A medical researcher wants to analyze breast cancer data in order to predict which one of three specific treatments a patient should receive. In each of these examples, the data analysis task is classification, where a model or classifier is constructed to predict categorical labels, such as "safe" or "risky" for the loan application data; "yes" or "no" for the marketing data; or "treatment A," "treatment B," or "treatment C" for the medical data. These categories can be represented by discrete values, where the ordering among values has no meaning. For example, the values 1, 2, and 3 may be used to represent treatments A, B, and C, where there is no ordering implied among this group of treatment regimes.


Suppose that the marketing manager would like to predict how much a given customer will spend during a sale. This data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued function, or ordered value, as opposed to a categorical label. This model is a predictor.


Classification- a two step process

Model construction: describing a set of predetermined classes


·         Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute.
·        The set of tuples used for model construction is a training set.
·         The model is represented as classification rules, decision tress, or mathematical formulae.

Model Prediction: for classifying future or unknown objects

  • Estimate Accuracy of the model

    ·         The unknown label of test sample is compared with the classified result from the model.
    ·         Accuracy rate is the percentage of test set samples that are correctly classified by the model.
    ·         Test set is independent of training set, otherwise over-fitting will occur.

    If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known.

    Comparison of Data Mining Algorithms Based on Classification and Prediction

    Classification and Prediction can be applied as explained above to data mining algorithms like Decision Tree Induction. And its accuracy can be compared to a classification algorithm like Naive Bayesian Classification and an analysis between the two may be chalked out.


No comments:

Post a Comment