Predict potential customers in R
Predict potential customers in R, Finding variables that can be utilized to determine whether a potential customer is a high earner may be of interest to a marketing analyst.
Information from the 1994 United States Census, which includes records from 32561 adults and a binary variable indicating whether each person makes more than $50000 or less, can be used to inform such a model. Our response variable is this.
library(mdsr) library(dplyr) census<-read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",header=FALSE) names(census)<-c("age","workclass","fnlwgt","education","education.num","marital.status", "occupation","relationship","race","sex","capital.gain","capital.loss", "hours.per.week","native.country","income") glimpse(census)
Rows: 32,561 Columns: 15 $ age <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32, 40, 34,~ $ workclass <chr> " State-gov", " Self-emp-not-inc", " Private", " Private", " Pr~ $ fnlwgt <int> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 209642, 4~ $ education <chr> " Bachelors", " Bachelors", " HS-grad", " 11th", " Bachelors", ~ $ education.num <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11, 4, 9, 9~ $ marital.status <chr> " Never-married", " Married-civ-spouse", " Divorced", " Married~ $ occupation <chr> " Adm-clerical", " Exec-managerial", " Handlers-cleaners", " Ha~ $ relationship <chr> " Not-in-family", " Husband", " Not-in-family", " Husband", " W~ $ race <chr> " White", " White", " White", " Black", " Black", " White", " B~ $ sex <chr> " Male", " Male", " Male", " Male", " Female", " Female", " Fem~ $ capital.gain <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0, 0,~ $ capital.loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50, 40, 45,~ $ native.country <chr> " United-States", " United-States", " United-States", " United-~ $ income <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K", " <=50K", " <~
We will divide our data set into two halves initially by randomly dividing the rows for reasons that we shall go into later.
The training data set will consist of a sample of 80% of the rows, and the testing data set will consist of the remaining 20% of the rows.
set.seed(364) n<-nrow(census) test_idx<-sample.int(n,size=round(0.2*n)) train<-census[-test_idx,] nrow(train) [1] 26049
test<-census[test_idx,] nrow(test) [1] 6512
Be aware that only 24% of the sample’s members earn more than $50,000. As a result, the null model’s accuracy is roughly 76% since we can correctly forecast that everyone makes less than $50,000 in most cases.
table(train$income) <=50K >50K 19843 6206 > 19843+6206 [1] 26049 > 19843/26049 [1] 0.7617567
Let’s think about the ideal income distribution using only the variable capital.gain, which counts the amount of capital gain taxes each person paid.
Our tree indicates that the best split happens for individuals who pay more than $5095.5 in capital gains.
library(rpart) rpart(income~capital.gain,data=train) n= 26049 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 26049 6206 <=50K (0.76175669 0.23824331) 2) capital.gain< 5119 24805 5030 <=50K (0.79721830 0.20278170) * 3) capital.gain>=5119 1244 68 >50K (0.05466238 0.94533762) *
Almost 80% of individuals who paid less than $5095.5 in capital gains tax made less than $50,000, but approximately 95% of those who paid more than $5095.5 in capital gain tax did so.
As a result, dividing the records into subsets that are substantially more pure can be achieved.
split<-5095.5 train<-train %>% mutate(hi_cap_gains=capital.gain>=split) library(ggplot2) ggplot(data=train, aes(x=capital.gain,y=income))+ geom_count(aes(color=hi_cap_gains), position=position_jitter(width=0,height=0.1),alpha=0.5)+ scale_x_log10(labels=scales::dollar)
In order to divide the data sets into those who paid more than $5095.5 in capital gains and those who did not, this decision tree uses a single variable.
We correctly predicted that the former, who account for 0.951 of all observations, would earn less than $50,000.
In the latter case, by assuming that they earned more than $50,000, we are 95% accurate.
As a result, our overall accuracy increases to 80.1%, substantially surpassing the null model’s 76%.
How did the algorithm determine that $5095.5 was a suitable threshold?
Please put your answers in the comment section.