Classification in Data Mining

Arbaj Khan
6 min readNov 11, 2021

--

What is Data mining ?

Data mining in general terms means mining or digging deep into data that is in different forms to gain patterns, and to gain knowledge on that pattern. In the process of data mining, large data sets are first sorted, then patterns are identified and relationships are established to perform data analysis and solve problems.

In data mining, you sort large data sets, find the required patterns and establish relationships to perform data analysis. It’s one of the pivotal steps in data analytics, and without it, you can’t complete a data analysis process.

What is Classification in Data Mining?

Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks.

It is a data analysis task, i.e. the process of finding a model that describes and distinguishes data classes and concepts. Classification is the problem of identifying to which of a set of categories , a new observation belongs to, on the basis of a training set of data containing observations and whose categories membership is known.

Types of Classification Techniques in Data Mining

let’s first look at the type of classification techniques available. Primarily, we can divide the classification algorithms into two categories:

1. Generative

2. Discriminative

Here’s a brief explanation of these two categories:

Discriminative: It is a very basic classifier and determines just one class for each row of data. It tries to model just by depending on the observed data, depends heavily on the quality of data rather than on distributions.

Example: Logistic Regression
Acceptance of a student at a University (Test and Grades need to be considered)
Suppose there are few students and the Result of them are as follows :

Generative: It models the distribution of individual classes and tries to learn the model that generates the data behind the scenes by estimating assumptions and distributions of the model. Used to predict the unseen data.

Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that too divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails). Now if a user wants to check that if an email contains the word cheap, then that may be termed as Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (= 80%)

Classifiers in Machine Learning:

Classification is a highly popular aspect of data mining. As a result, machine learning has many classifiers:

1. Logistic regression

2. Linear regression

3. Decision trees

4. Random forest

5. Naive Bayes

6. Support Vector Machines

7. K-nearest neighbours

1. Logistic Regression

Logistic regression allows you to model the probability of a particular event or class. It uses a logistic to model a binary dependent variable. It gives you the probabilities of a single trial. Because logistic regression was built for classification and helps you understand the impact of multiple independent variables on a single outcome variable.

2. Linear Regression

Linear regression is based on supervised learning and performs regression. It models a prediction value according to independent variables. Primarily, we use it to find out the relationship between the forecasting and the variables.

It predicts a dependent variable value according to a specific independent variable. Particularly, it finds the linear relationship between the independent variable and the dependent variable. It’s excellent for data you can separate linear and is highly efficient. However, it is prone to overfitting and nose. Moreover, it relies on the assumption that the independent and dependent variables are related linearly.

3. Decision Trees

The decision tree is the most robust classification technique in data mining. It is a flowchart similar to a tree structure. Here, every internal node refers to a test on a condition, and each branch stands for an outcome of the test (whether it’s true or false). Every leaf node in a decision tree holds a class label.

You can split the data into different classes according to the decision tree. It would predict which classes a new data point would belong to according to the created decision tree.

4. Random forest

The random forest classifier fits multiple decision trees on different dataset sub-samples. It uses the average to enhance its predictive accuracy and manage overfitting. The sub-sample size is always equal to the input sample size; however, the samples are drawn with replacement.

A peculiar advantage of the random forest classifier is it reduces overfitting. Moreover, this classifier has significantly more accuracy than decision trees. However, it is a lot slower algorithm for real-time prediction and is a highly complicated algorithm, hence, very challenging to implement effectively.

5. Naive Bayes

The Naive Bayes algorithm assumes that every feature is independent of each other and that all the features contribute equally to the outcome.

Another assumption this algorithm relies upon is that all features have equal importance. It has many applications in today’s world, such as spam filtering and classifying documents. Naive Bayes only requires a small quantity of training data for the estimation of the required parameters. Moreover, a Naive Bayes classifier is significantly faster than other sophisticated and advanced classifiers.

6. Support Vector Machine

The Support vector machine algorithm, also known as SVM, represents the training data in space differentiated into categories by large gaps. New data points are then mapped into the same space, and their categories are predicted according to the side of the gap they fall into. This algorithm is especially useful in high dimensional spaces and is quite memory efficient because it only employs a subset of training points in its decision function.

This algorithm lags in providing probability estimations. You’d need to calculate them through five-fold cross-validation, which is highly expensive.

7. K-Nearest Neighbours

The k-nearest neighbor algorithm has non-linear prediction boundaries as it’s a non-linear classifier. It predicts the class of a new test data point by finding its k nearest neighbours’ class. You’d select the k nearest neighbours of a test data point by using the Euclidean distance. In the k nearest neighbours, you’d have to count the number of data points present in different categories, and you’d assign the new data point to the category with the most neighbors.

Real–Life Examples :

Market Basket Analysis:
It is a modeling technique that has been associated with frequent transactions of buying some combination of items.
Example: Amazon and many other Retailers use this technique. While viewing some products, certain suggestions for the commodities are shown that some people have bought in the past.

Weather Forecasting:
Changing Patterns in weather conditions needs to be observed based on parameters such as temperature, humidity, wind direction. This keen observation also requires the use of previous records in order to predict it accurately.

Advantages:

Mining Based Methods are cost-effective and efficient

Helps in identifying criminal suspects

Helps in predicting the risk of diseases

Helps Banks and Financial Institutions to identify defaulters so that they may approve Cards, Loan, etc.

Disadvantages:

Privacy: When the data is either are chances that a company may give some information about their customers to other vendors or use this information for their profit.
Accuracy Problem: Selection of Accurate model must be there in order to get the best accuracy and result.

Applications of Classification of Data Mining Systems

l Marketers use classification algorithms for audience segmentation. They classify their target audiences into different categories by using these algorithms to devise more accurate and effective marketing strategies.

l Meteorologists use these algorithms to predict the weather conditions according to various parameters such as humidity, temperature, etc.

l Public health experts use classifiers for predicting the risk of various diseases and create strategies to mitigate their spread.

l Financial institutions use classification algorithms to find defaulters to determine whose cards and loans they should approve. It also helps them in detecting fraud.

Conclusion

Classification is among the most popular sections of data mining. As you can see, it has a ton of applications in our daily lives.

In the future, data mining will include more complex data types. In addition, for any model that has been designed, further refinement is possible by examining other variables and their relationships. Research in data mining will result in new methods to determine the most interesting characteristics in the data.

--

--