Predict potential customers in R

by finnstats

Predict potential customers in R, Finding variables that can be utilized to determine whether a potential customer is a high earner may be of interest to a marketing analyst.

Information from the 1994 United States Census, which includes records from 32561 adults and a binary variable indicating whether each person makes more than $50000 or less, can be used to inform such a model. Our response variable is this.

library(mdsr)
library(dplyr)
census<-read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",header=FALSE)
names(census)<-c("age","workclass","fnlwgt","education","education.num","marital.status",
                 "occupation","relationship","race","sex","capital.gain","capital.loss",
                 "hours.per.week","native.country","income")
glimpse(census)

Rows: 32,561
Columns: 15
$ age            <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32, 40, 34,~
$ workclass      <chr> " State-gov", " Self-emp-not-inc", " Private", " Private", " Pr~
$ fnlwgt         <int> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 209642, 4~
$ education      <chr> " Bachelors", " Bachelors", " HS-grad", " 11th", " Bachelors", ~
$ education.num  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11, 4, 9, 9~
$ marital.status <chr> " Never-married", " Married-civ-spouse", " Divorced", " Married~
$ occupation     <chr> " Adm-clerical", " Exec-managerial", " Handlers-cleaners", " Ha~
$ relationship   <chr> " Not-in-family", " Husband", " Not-in-family", " Husband", " W~
$ race           <chr> " White", " White", " White", " Black", " Black", " White", " B~
$ sex            <chr> " Male", " Male", " Male", " Male", " Female", " Female", " Fem~
$ capital.gain   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0, 0,~
$ capital.loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50, 40, 45,~
$ native.country <chr> " United-States", " United-States", " United-States", " United-~
$ income         <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K", " <=50K", " <~

We will divide our data set into two halves initially by randomly dividing the rows for reasons that we shall go into later.

The training data set will consist of a sample of 80% of the rows, and the testing data set will consist of the remaining 20% of the rows.

set.seed(364)
n<-nrow(census)
test_idx<-sample.int(n,size=round(0.2*n))
train<-census[-test_idx,]
nrow(train)
[1] 26049

test<-census[test_idx,]
nrow(test)
[1] 6512

Be aware that only 24% of the sample’s members earn more than $50,000. As a result, the null model’s accuracy is roughly 76% since we can correctly forecast that everyone makes less than $50,000 in most cases.

table(train$income)
<=50K   >50K
 19843   6206
> 19843+6206
[1] 26049
> 19843/26049
[1] 0.7617567

Let’s think about the ideal income distribution using only the variable capital.gain, which counts the amount of capital gain taxes each person paid.

Our tree indicates that the best split happens for individuals who pay more than $5095.5 in capital gains.

library(rpart)
rpart(income~capital.gain,data=train)
n= 26049
node), split, n, loss, yval, (yprob)
      * denotes terminal node
  1) root 26049 6206  <=50K (0.76175669 0.23824331) 
  2) capital.gain< 5119 24805 5030  <=50K (0.79721830 0.20278170) *
  3) capital.gain>=5119 1244   68  >50K (0.05466238 0.94533762) *

Almost 80% of individuals who paid less than $5095.5 in capital gains tax made less than $50,000, but approximately 95% of those who paid more than $5095.5 in capital gain tax did so.

As a result, dividing the records into subsets that are substantially more pure can be achieved.

split<-5095.5
train<-train %>% mutate(hi_cap_gains=capital.gain>=split)
library(ggplot2)
ggplot(data=train, aes(x=capital.gain,y=income))+
  geom_count(aes(color=hi_cap_gains),
             position=position_jitter(width=0,height=0.1),alpha=0.5)+
  scale_x_log10(labels=scales::dollar)

In order to divide the data sets into those who paid more than $5095.5 in capital gains and those who did not, this decision tree uses a single variable.

We correctly predicted that the former, who account for 0.951 of all observations, would earn less than $50,000.

In the latter case, by assuming that they earned more than $50,000, we are 95% accurate.

As a result, our overall accuracy increases to 80.1%, substantially surpassing the null model’s 76%.

How did the algorithm determine that $5095.5 was a suitable threshold?

Please put your answers in the comment section.

Predict potential customers in R

You may also like...

Leave a Reply Cancel reply

Quality articles need supporters. Will you be one?

Predict potential customers in R

You may also like...

Gamma distribution in R

Age structure diagram in R

How to Use Mutate function in R

Leave a Reply Cancel reply

Quality articles need supporters. Will you be one?