Credit Card Fraud Detection in R

Credit Card Fraud Detection in R, We will learn how to perform credit card detection in this R project.

We’ll go over a variety of methods, including Gradient Boosting Classifiers, Logistic Regression, Decision Trees, and Artificial Neural Networks.

We will use the Card Transactions dataset, which includes both fraudulent and legitimate transactions, to carry out credit card fraud detection.

Autocorrelation and Partial Autocorrelation in Time Series (datasciencetut.com)

Credit Card Fraud Detection in R

This R project aims to create a classifier that can recognize fraudulent credit card transactions.

We will employ a range of machine-learning techniques that can distinguish between fraudulent and non-fraudulent transactions.

You will learn how to use machine-learning algorithms to accomplish categorization by the end of this machine-learning project.

How to compare the performance of different algorithms in R? (datasciencetut.com)

1. Bringing in the Datasets

We are importing the datasets that include credit card transaction data.

library(ranger)
library(caret)
library(data.table)
creditcard_data <- read.csv("D:/RStudio/creditcard.csv")

2. Data Exploration

We will examine the data in the credit carddata data frame in this section of the fraud detection ML project.

We’ll continue by showing the credit card data using both the head() and tail() functions. The remaining elements of this data frame will then be explored.

Cross-validation in Machine Learning – Data Science Tutorials

dim(creditcard_data)
[1] 284807     31
head(creditcard_data,6)
Time         V1          V2        V3         V4          V5          V6
1    0 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
2    0  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
3    1 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
4    1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
5    2 -1.1582331  0.87773676 1.5487178  0.4030339 -0.40719338  0.09592146
6    2 -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755
           V7          V8         V9         V10        V11         V12
1  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
2 -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
3  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
4  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
5  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
6  0.47620095  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384
         V13        V14        V15        V16         V17         V18
1 -0.9913898 -0.3111694  1.4681770 -0.4704005  0.20797124  0.02579058
2  0.4890950 -0.1437723  0.6355581  0.4639170 -0.11480466 -0.18336127
3  0.7172927 -0.1659459  2.3458649 -2.8900832  1.10996938 -0.12135931
4  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279  1.96577500
5  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.23703324 -0.03819479
6 -0.3580907 -0.1371337  0.5176168  0.4017259 -0.05813282  0.06865315
          V19         V20          V21          V22         V23         V24
1  0.40399296  0.25141210 -0.018306778  0.277837576 -0.11047391  0.06692808
2 -0.14578304 -0.06908314 -0.225775248 -0.638671953  0.10128802 -0.33984648
3 -2.26185709  0.52497973  0.247998153  0.771679402  0.90941226 -0.68928096
4 -1.23262197 -0.20803778 -0.108300452  0.005273597 -0.19032052 -1.17557533
5  0.80348692  0.40854236 -0.009430697  0.798278495 -0.13745808  0.14126698
6 -0.03319379  0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
         V25        V26          V27         V28 Amount Class
1  0.1285394 -0.1891148  0.133558377 -0.02105305 149.62     0
2  0.1671704  0.1258945 -0.008983099  0.01472417   2.69     0
3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66     0
4  0.6473760 -0.2219288  0.062722849  0.06145763 123.50     0
5 -0.2060096  0.5022922  0.219422230  0.21515315  69.99     0
6 -0.2327938  0.1059148  0.253844225  0.08108026   3.67     0
tail(creditcard_data,6)
Time          V1          V2         V3         V4          V5
284802 172785   0.1203164  0.93100513 -0.5460121 -0.7450968  1.13031398
284803 172786 -11.8811179 10.07178497 -9.8347835 -2.0666557 -5.36447278
284804 172787  -0.7327887 -0.05508049  2.0350297 -0.7385886  0.86822940
284805 172788   1.9195650 -0.30125385 -3.2496398 -0.5578281  2.63051512
284806 172788  -0.2404400  0.53048251  0.7025102  0.6897992 -0.37796113
284807 172792  -0.5334125 -0.18973334  0.7033374 -0.5062712 -0.01254568
               V6         V7         V8         V9        V10        V11
284802 -0.2359732  0.8127221  0.1150929 -0.2040635 -0.6574221  0.6448373
284803 -2.6068373 -4.9182154  7.3053340  1.9144283  4.3561704 -1.5931053
284804  1.0584153  0.0243297  0.2948687  0.5848000 -0.9759261 -0.1501888
284805  3.0312601 -0.2968265  0.7084172  0.4324540 -0.4847818  0.4116137
284806  0.6237077 -0.6861800  0.6791455  0.3920867 -0.3991257 -1.9338488
284807 -0.6496167  1.5770063 -0.4146504  0.4861795 -0.9154266 -1.0404583
               V12        V13         V14         V15        V16         V17
284802  0.19091623 -0.5463289 -0.73170658 -0.80803553  0.5996281  0.07044075
284803  2.71194079 -0.6892556  4.62694202 -0.92445871  1.1076406  1.99169111
284804  0.91580191  1.2147558 -0.67514296  1.16493091 -0.7117573 -0.02569286
284805  0.06311886 -0.1836987 -0.51060184  1.32928351  0.1407160  0.31350179
284806 -0.96288614 -1.0420817  0.44962444  1.96256312 -0.6085771  0.50992846
284807 -0.03151305 -0.1880929 -0.08431647  0.04133345 -0.3026201 -0.66037665
              V18        V19         V20        V21        V22         V23
284802  0.3731103  0.1289038 0.000675833 -0.3142046 -0.8085204  0.05034266
284803  0.5106323 -0.6829197 1.475829135  0.2134541  0.1118637  1.01447990
284804 -1.2211789 -1.5455561 0.059615900  0.2142053  0.9243836  0.01246304
284805  0.3956525 -0.5772518 0.001395970  0.2320450  0.5782290 -0.03750085
284806  1.1139806  2.8978488 0.127433516  0.2652449  0.8000487 -0.16329794
284807  0.1674299 -0.2561169 0.382948105  0.2610573  0.6430784  0.37677701
                V24        V25        V26          V27         V28 Amount
284802  0.102799590 -0.4358701  0.1240789  0.217939865  0.06880333   2.69
284803 -0.509348453  1.4368069  0.2500343  0.943651172  0.82373096   0.77
284804 -1.016225669 -0.6066240 -0.3952551  0.068472470 -0.05352739  24.79
284805  0.640133881  0.2657455 -0.0873706  0.004454772 -0.02656083  67.88
284806  0.123205244 -0.5691589  0.5466685  0.108820735  0.10453282  10.00
284807  0.008797379 -0.4736487 -0.8182671 -0.002415309  0.01364891 217.00
       Class
284802     0
284803     0
284804     0
284805     0
284806     0
284807     0
table(creditcard_data$Class)
   0      1 
284315    492 
summary(creditcard_data$Amount)
 Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    0.00     5.60    22.00    88.35    77.17 25691.16 
names(creditcard_data)
[1] "Time"   "V1"     "V2"     "V3"     "V4"     "V5"     "V6"     "V7"    
 [9] "V8"     "V9"     "V10"    "V11"    "V12"    "V13"    "V14"    "V15"   
[17] "V16"    "V17"    "V18"    "V19"    "V20"    "V21"    "V22"    "V23"   
[25] "V24"    "V25"    "V26"    "V27"    "V28"    "Amount" "Class" 
var(creditcard_data$Amount)
[1] 62560.07
sd(creditcard_data$Amount)
250.1201

3. Data Manipulation

Using the scale() function, we will scale our data in this phase of the R data science project. This will be applied to our creditcard data amount’s amount portion.

Scaling and feature standardization are synonyms. The data is organized according to a defined range with the use of scaling.

Triangular Distribution in R – Data Science Tutorials

As a result, there are no extreme values in our dataset that could prevent our model from working properly. The way we’ll do this is as follows:

head(creditcard_data)
creditcard_data$Amount=scale(creditcard_data$Amount)
NewData=creditcard_data[,-c(1)]
head(NewData)
 V1          V2        V3         V4          V5          V6
1 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
2  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
3 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
4 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
5 -1.1582331  0.87773676 1.5487178  0.4030339 -0.40719338  0.09592146
6 -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755
           V7          V8         V9         V10        V11         V12
1  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
2 -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
3  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
4  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
5  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
6  0.47620095  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384
         V13        V14        V15        V16         V17         V18
1 -0.9913898 -0.3111694  1.4681770 -0.4704005  0.20797124  0.02579058
2  0.4890950 -0.1437723  0.6355581  0.4639170 -0.11480466 -0.18336127
3  0.7172927 -0.1659459  2.3458649 -2.8900832  1.10996938 -0.12135931
4  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279  1.96577500
5  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.23703324 -0.03819479
6 -0.3580907 -0.1371337  0.5176168  0.4017259 -0.05813282  0.06865315
          V19         V20          V21          V22         V23         V24
1  0.40399296  0.25141210 -0.018306778  0.277837576 -0.11047391  0.06692808
2 -0.14578304 -0.06908314 -0.225775248 -0.638671953  0.10128802 -0.33984648
3 -2.26185709  0.52497973  0.247998153  0.771679402  0.90941226 -0.68928096
4 -1.23262197 -0.20803778 -0.108300452  0.005273597 -0.19032052 -1.17557533
5  0.80348692  0.40854236 -0.009430697  0.798278495 -0.13745808  0.14126698
6 -0.03319379  0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
         V25        V26          V27         V28      Amount Class
1  0.1285394 -0.1891148  0.133558377 -0.02105305  0.24496383     0
2  0.1671704  0.1258945 -0.008983099  0.01472417 -0.34247394     0
3 -0.3276418 -0.1390966 -0.055352794 -0.05975184  1.16068389     0
4  0.6473760 -0.2219288  0.062722849  0.06145763  0.14053401     0
5 -0.2060096  0.5022922  0.219422230  0.21515315 -0.07340321     0
6 -0.2327938  0.1059148  0.253844225  0.08108026 -0.33855582     0

4. Data Modeling

Our complete dataset will be standardized before being divided into a training set and a test set with a split ratio of 0.80.

As a result, 80% of our data will be ascribed to the train data and 20% to the test data. The dim() function will then be used to determine the dimensions.

How to Calculate Relative Frequencies in R? – Data Science Tutorials

library(caTools)
set.seed(123)
data_sample = sample.split(NewData$Class,SplitRatio=0.80)
train_data = subset(NewData,data_sample==TRUE)
test_data = subset(NewData,data_sample==FALSE)
dim(train_data)
[1] 227846     30
dim(test_data)
[1] 56961    30

5. Fitting Logistic Regression Model

We will fit our first model in this phase of the project to detect credit card fraud. With logistic regression, we’ll start.

For estimating the likelihood of a result in a class, such as pass/fail, positive/negative, and in our instance, fraud/not fraud, logistic regression is used.

Following are the steps we take to apply this model to the test data:

Logistic_Model=glm(Class~.,test_data,family=binomial())
summary(Logistic_Model)
Call:
glm(formula = Class ~ ., family = binomial(), data = test_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.9019  -0.0254  -0.0156  -0.0078   4.0877  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) -12.52800   10.30537  -1.216   0.2241  
V1           -0.17299    1.27381  -0.136   0.8920  
V2            1.44512    4.23062   0.342   0.7327  
V3            0.17897    0.24058   0.744   0.4569  
V4            3.13593    7.17768   0.437   0.6622  
V5            1.49014    3.80369   0.392   0.6952  
V6           -0.12428    0.22202  -0.560   0.5756  
V7            1.40903    4.22644   0.333   0.7388  
V8           -0.35254    0.17462  -2.019   0.0435 *
V9            3.02176    8.67262   0.348   0.7275  
V10          -2.89571    6.62383  -0.437   0.6620  
V11          -0.09769    0.28270  -0.346   0.7297  
V12           1.97992    6.56699   0.301   0.7630  
V13          -0.71674    1.25649  -0.570   0.5684  
V14           0.19316    3.28868   0.059   0.9532  
V15           1.03868    2.89256   0.359   0.7195  
V16          -2.98194    7.11391  -0.419   0.6751  
V17          -1.81809    4.99764  -0.364   0.7160  
V18           2.74772    8.13188   0.338   0.7354  
V19          -1.63246    4.77228  -0.342   0.7323  
V20          -0.69925    1.15114  -0.607   0.5436  
V21          -0.45082    1.99182  -0.226   0.8209  
V22          -1.40395    5.18980  -0.271   0.7868  
V23           0.19026    0.61195   0.311   0.7559  
V24          -0.12889    0.44701  -0.288   0.7731  
V25          -0.57835    1.94988  -0.297   0.7668  
V26           2.65938    9.34957   0.284   0.7761  
V27          -0.45396    0.81502  -0.557   0.5775  
V28          -0.06639    0.35730  -0.186   0.8526  
Amount        0.22576    0.71892   0.314   0.7535  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1443.40  on 56960  degrees of freedom
Residual deviance:  378.59  on 56931  degrees of freedom
AIC: 438.59

Number of Fisher Scoring iterations: 17

Following a summary of our model, the following graphs will help us visualize it:

plot(Logistic_Model)

We will draw the ROC curve in order to evaluate the effectiveness of our model.

How to Label Outliers in Boxplots in ggplot2? (datasciencetut.com)

Receiver Optimistic Characteristics, or ROC, is another name for them.

In order to do this, we will first load the ROC package before plotting our ROC curve and evaluating its effectiveness.

library(pROC)
lr.predict <- predict(Logistic_Model,train_data, probability = TRUE)
auc.gbm = roc(test_data$Class, lr.predict, plot = TRUE, col = "blue")

6. Fitting a Decision Tree Model

We shall put a decision tree algorithm into practice in this part. Plotting the results of a decision using decision trees.

These results are essentially a consequence that allows us to determine what class the object belongs to.

Now that our decision tree model has been implemented, we will plot it via the rpart.plot() function.

To draw the decision tree, we will explicitly employ recursive partitioning.

library(rpart)
library(rpart.plot)
decisionTree_model <- rpart(Class ~ . , creditcard_data, method = 'class')
predicted_val <- predict(decisionTree_model, creditcard_data, type = 'class')
probability <- predict(decisionTree_model, creditcard_data, type = 'prob')
rpart.plot(decisionTree_model)

7. Artificial Neural Network

A sort of machine learning algorithm that is based on the human nervous system is called an artificial neural network.

The ANN models may do categorization on the input data and can learn patterns from historical data. The neural net package that would enable us to use our ANNs is imported.

Gamma distribution in R – Data Science Tutorials

Then, we used the plot() function to plot it. Now, there is a range of values for Artificial Neural Networks that is between 1 and 0.

We established a threshold of 0.5, meaning that numbers above this value correspond to 1, and values below this value to 0.

We put this into practice as follows:

library(neuralnet)
ANN_model =neuralnet (Class~.,train_data,linear.output=FALSE)
plot(ANN_model)
predANN=compute(ANN_model,test_data)
resultANN=predANN$net.result
resultANN=ifelse(resultANN>0.5,1,0)

8. Gradient Boosting (GBM)

A well-liked machine learning approach called gradient boosting is employed to carry out classification and regression tasks.

Weak decision trees are one of the underlying ensemble models that make up this model. An effective gradient-boosting model is created by the combination of these decision trees.

The following is how we’ll incorporate the gradient descent algorithm into our model:

Random Forest Machine Learning Introduction – Data Science Tutorials

library(gbm, quietly=TRUE)
system.time(
       model_gbm <- gbm(Class ~ .
               , distribution = "bernoulli"
               , data = rbind(train_data, test_data)
               , n.trees = 500
               , interaction.depth = 3
               , n.minobsinnode = 100
               , shrinkage = 0.01
               , bag.fraction = 0.5
               , train.fraction = nrow(train_data) / (nrow(train_data) + nrow(test_data))
)
)
# Determine best iteration based on test data
gbm.iter = gbm.perf(model_gbm, method = "test")
model.influence = relative.influence(model_gbm, n.trees = gbm.iter, sort. = TRUE)
plot(model_gbm)
gbm_test = predict(model_gbm, newdata = test_data, n.trees = gbm.iter)
gbm_auc = roc(test_data$Class, gbm_test, plot = TRUE, col = "red")
print(gbm_auc)

Summary

We learned how to create a credit card fraud detection model using machine learning as part of our R Data Science project.

In order to develop this model, we used a range of ML methods. We also presented the performance curves for each model.

We learned how to distinguish fraudulent transactions from other forms of data by analyzing and visualizing data.

I hope you liked the aforementioned R project. Use the comments section to share your knowledge and questions.

Best Books to learn Tensorflow – Data Science Tutorials

You may also like...

No Responses

  1. Jerry Tuttle says:

    Hi. Is the dataset publicly available? We can’t access it with your statement read.csv(“D:/RStudio/creditcard.csv”)

    Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *

6 − 3 =