Classification Problem in Machine Learning
Classification Problem in Machine Learning, We’ll discuss classification in this post when the response variable contains two or more values.
And, in fact, this is a very prevalent problem, perhaps even more so than regression. There’s a lot of emphasis on classification in machine learning.
For example, we’re attempting to forecast if an email is good or spam, or whether a patient will survive or die from a specific ailment.
As a result, it’s a very common and critical problem. So, we’re going to spend some time talking about classification.
The first thing we do is demonstrate the appearance of categorical variables.
Sure, you’re aware of this example, eye color. Brown, blue, and green are the three colors that appear.
That’s a set of discrete values. There is no such thing as a menu. It’s only three different numbers.
We’ve already discussed email: Spam and ham are also present. As a result, spam filters must classify messages as spam or ham.
So what is a classifier?
So, just as in regression, you’ve got a feature vector X. You now have one of these qualitative response variables, similar to the ones listed above.
The response accepts values from set C, which is a discrete set of values.
The classification task is to create a function that takes X as an input and outputs one of the set C’s items. As a result, in mathematic terminology, C of X provides your values in the set C.
In the spam/ham conundrum, for example, C of X would either return spam or ham.
Although classification problems are often phrased in this way, we’re usually more interested in evaluating the chances that X belongs to each of the categories C.
For example, having an estimate of the probability that an insurance claim is fraudulent is more helpful to an insurance firm than having a classification of fraudulent or not.
You can suppose that in one case, you might have a 0.9 chance that the claim is false. It could be 0.90 in one case and 0.98 in another. Now, in both of these circumstances, those might be enough to raise the red flag that this is a fake insurance claim.
However, if you’re going to investigate the claim and spend some time doing so, you’ll probably go for the 0.98 first, rather than the 0.9. As a result, calculating the probability is essential.
So, here’s some information. There are two variables:-
Here’s a scatter plot of balance versus income on the left. So they are two of the factors to consider.
We may also code the response variable into the plot as a color, much like we can with classification issues. As a result, we have the blue and brown points here.
Those that defaulted will be represented by the brown points. The blue points represent those who did not.
This is, of course, a fake data set. You don’t usually expect to see that many defaulters. However, we’ll return to the topic of balance in classification problems later.
So it appears that balance is the most crucial variable in this graph. Keep in mind that there’s a major difference between the blues and the browns, defaulters, and non-defaulters, OK?
When it comes to income, though, there doesn’t appear to be much of a distinction, OK?
We actually present box graphs of these two variables on the right. And we can see that there’s a large difference in the distributions-balance default or not, although there’s no difference in the income.
There doesn’t appear to be much of a difference. There is a black line that represents the median.
People who have defaulted on their payments. The quartiles are at the top of the box. That’s the 75th quartile or the 75th percentile.
The bottom of the box is on the 25th. So, overall, it’s a good summary of the income distribution. Yes, for individuals who fall into this group.
Data points that fall outside the hinges are referred to as outliers. It’s a pretty handy data display, by the way.
When you initially acquire some data to evaluate, one of the first things you should do is make scatter plots and box plots.
John Tukey is credited for inventing the box plot.
It appears that we may be able to do so. So, for the default classification task, we’re going to code the response 0 if there’s no default and 1 if there is, right?
It’s a bit arbitrary, but 0 and 1 appear to suffice. Then we could just do a linear regression of Y on X, with X being the two variables in this example, and categorize it as yes if Y is greater than 0.5, 50%, right?
0.5 is the midpoint between 0 and 1. Linear regression, which is similar to linear discriminant analysis, works well for binary outcomes. So, in the case of a two-class classification problem, such as
This, on the other hand, does a fantastic job. There’s even some theoretical support for it. Remember that in the population, regression is defined as estimating the conditional mean of Y given X.
In our coding of 0 and 1, the conditional mean of the 0 1 variable given X is just the chance that Y is 1 given X, as determined by simple probability theory.
As a result, you may believe that regression is ideal for this purpose. However, we’ll see that linear regression might give probabilities that are less than 0 or even more than 1.
As a result, we’ll show you how to use logistic regression, which is more suited. And here’s a tiny illustration to prove it.
We’ve got our balance variable here. We’ve now planned against the balance. The browns are the 0’s at the bottom, displayed as small dashes.
At the bottom, the small brown spikes are all clumped together. The 1s are plotted at the top of the page. And we can see the division. The brown 0’s are to the left of the center of gravity.
And the ones on the right are the ones. The blue line is a line of linear regression. It drops to a negative number.
So that’s not a very good probability estimate. It also doesn’t seem to go high enough on the right-hand side, where there appears to be a distinct preponderance of yes.
The foot of logistic regression may be seen in the right-hand graphic. And it appears to be performing admirably in this instance. It never goes above 0 and 1, and it always seems to go up to where it’s supposed to go up.
In terms of predicting probabilities, it appears that linear regression isn’t doing that well.
So, what happens if we have a variable with three categories?. So here’s a variable that takes on three levels and measures the patient’s state in an emergency room.
It’s a stroke if it’s 1, a drug overdose if it’s 2, and an epileptic seizure if it’s 3.
So, if we code those as 1, 2, and 3, which is an arbitrary but natural choice, this coding may indicate an ordering when there isn’t one.
And it could indicate that the one-unit difference between a stroke and a drug overdose is the same as the one-unit difference between a drug overdose and an epileptic episode.
When you have more than two categories, it seems risky to assign numbers to them arbitrarily, especially if you’re planning to utilize it in linear regression.
And it turns out that linear regression isn’t the best option. We’ll use multiclass logistic regression or discriminant analysis to solve problems like these.
Ok too much in this post, will discuss more in an upcoming post.