Credit Card Fraud Detection in R
Credit Card Fraud Detection in R, We will learn how to perform credit card detection in this R project.
We’ll go over a variety of methods, including Gradient Boosting Classifiers, Logistic Regression, Decision Trees, and Artificial Neural Networks.
We will use the Card Transactions dataset, which includes both fraudulent and legitimate transactions, to carry out credit card fraud detection.
Autocorrelation and Partial Autocorrelation in Time Series (datasciencetut.com)
Credit Card Fraud Detection in R
This R project aims to create a classifier that can recognize fraudulent credit card transactions.
We will employ a range of machine-learning techniques that can distinguish between fraudulent and non-fraudulent transactions.
You will learn how to use machine-learning algorithms to accomplish categorization by the end of this machine-learning project.
How to compare the performance of different algorithms in R? (datasciencetut.com)
1. Bringing in the Datasets
We are importing the datasets that include credit card transaction data.
library(ranger) library(caret) library(data.table) creditcard_data <- read.csv("D:/RStudio/creditcard.csv")
2. Data Exploration
We will examine the data in the credit carddata data frame in this section of the fraud detection ML project.
We’ll continue by showing the credit card data using both the head() and tail() functions. The remaining elements of this data frame will then be explored.
Cross-validation in Machine Learning – Data Science Tutorials
dim(creditcard_data) [1] 284807 31 head(creditcard_data,6) Time V1 V2 V3 V4 V5 V6 1 0 -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778 2 0 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081 3 1 -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938 4 1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317 5 2 -1.1582331 0.87773676 1.5487178 0.4030339 -0.40719338 0.09592146 6 2 -0.4259659 0.96052304 1.1411093 -0.1682521 0.42098688 -0.02972755 V7 V8 V9 V10 V11 V12 1 0.23959855 0.09869790 0.3637870 0.09079417 -0.5515995 -0.61780086 2 -0.07880298 0.08510165 -0.2554251 -0.16697441 1.6127267 1.06523531 3 0.79146096 0.24767579 -1.5146543 0.20764287 0.6245015 0.06608369 4 0.23760894 0.37743587 -1.3870241 -0.05495192 -0.2264873 0.17822823 5 0.59294075 -0.27053268 0.8177393 0.75307443 -0.8228429 0.53819555 6 0.47620095 0.26031433 -0.5686714 -0.37140720 1.3412620 0.35989384 V13 V14 V15 V16 V17 V18 1 -0.9913898 -0.3111694 1.4681770 -0.4704005 0.20797124 0.02579058 2 0.4890950 -0.1437723 0.6355581 0.4639170 -0.11480466 -0.18336127 3 0.7172927 -0.1659459 2.3458649 -2.8900832 1.10996938 -0.12135931 4 0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279 1.96577500 5 1.3458516 -1.1196698 0.1751211 -0.4514492 -0.23703324 -0.03819479 6 -0.3580907 -0.1371337 0.5176168 0.4017259 -0.05813282 0.06865315 V19 V20 V21 V22 V23 V24 1 0.40399296 0.25141210 -0.018306778 0.277837576 -0.11047391 0.06692808 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953 0.10128802 -0.33984648 3 -2.26185709 0.52497973 0.247998153 0.771679402 0.90941226 -0.68928096 4 -1.23262197 -0.20803778 -0.108300452 0.005273597 -0.19032052 -1.17557533 5 0.80348692 0.40854236 -0.009430697 0.798278495 -0.13745808 0.14126698 6 -0.03319379 0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658 V25 V26 V27 V28 Amount Class 1 0.1285394 -0.1891148 0.133558377 -0.02105305 149.62 0 2 0.1671704 0.1258945 -0.008983099 0.01472417 2.69 0 3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66 0 4 0.6473760 -0.2219288 0.062722849 0.06145763 123.50 0 5 -0.2060096 0.5022922 0.219422230 0.21515315 69.99 0 6 -0.2327938 0.1059148 0.253844225 0.08108026 3.67 0
tail(creditcard_data,6) Time V1 V2 V3 V4 V5 284802 172785 0.1203164 0.93100513 -0.5460121 -0.7450968 1.13031398 284803 172786 -11.8811179 10.07178497 -9.8347835 -2.0666557 -5.36447278 284804 172787 -0.7327887 -0.05508049 2.0350297 -0.7385886 0.86822940 284805 172788 1.9195650 -0.30125385 -3.2496398 -0.5578281 2.63051512 284806 172788 -0.2404400 0.53048251 0.7025102 0.6897992 -0.37796113 284807 172792 -0.5334125 -0.18973334 0.7033374 -0.5062712 -0.01254568 V6 V7 V8 V9 V10 V11 284802 -0.2359732 0.8127221 0.1150929 -0.2040635 -0.6574221 0.6448373 284803 -2.6068373 -4.9182154 7.3053340 1.9144283 4.3561704 -1.5931053 284804 1.0584153 0.0243297 0.2948687 0.5848000 -0.9759261 -0.1501888 284805 3.0312601 -0.2968265 0.7084172 0.4324540 -0.4847818 0.4116137 284806 0.6237077 -0.6861800 0.6791455 0.3920867 -0.3991257 -1.9338488 284807 -0.6496167 1.5770063 -0.4146504 0.4861795 -0.9154266 -1.0404583 V12 V13 V14 V15 V16 V17 284802 0.19091623 -0.5463289 -0.73170658 -0.80803553 0.5996281 0.07044075 284803 2.71194079 -0.6892556 4.62694202 -0.92445871 1.1076406 1.99169111 284804 0.91580191 1.2147558 -0.67514296 1.16493091 -0.7117573 -0.02569286 284805 0.06311886 -0.1836987 -0.51060184 1.32928351 0.1407160 0.31350179 284806 -0.96288614 -1.0420817 0.44962444 1.96256312 -0.6085771 0.50992846 284807 -0.03151305 -0.1880929 -0.08431647 0.04133345 -0.3026201 -0.66037665 V18 V19 V20 V21 V22 V23 284802 0.3731103 0.1289038 0.000675833 -0.3142046 -0.8085204 0.05034266 284803 0.5106323 -0.6829197 1.475829135 0.2134541 0.1118637 1.01447990 284804 -1.2211789 -1.5455561 0.059615900 0.2142053 0.9243836 0.01246304 284805 0.3956525 -0.5772518 0.001395970 0.2320450 0.5782290 -0.03750085 284806 1.1139806 2.8978488 0.127433516 0.2652449 0.8000487 -0.16329794 284807 0.1674299 -0.2561169 0.382948105 0.2610573 0.6430784 0.37677701 V24 V25 V26 V27 V28 Amount 284802 0.102799590 -0.4358701 0.1240789 0.217939865 0.06880333 2.69 284803 -0.509348453 1.4368069 0.2500343 0.943651172 0.82373096 0.77 284804 -1.016225669 -0.6066240 -0.3952551 0.068472470 -0.05352739 24.79 284805 0.640133881 0.2657455 -0.0873706 0.004454772 -0.02656083 67.88 284806 0.123205244 -0.5691589 0.5466685 0.108820735 0.10453282 10.00 284807 0.008797379 -0.4736487 -0.8182671 -0.002415309 0.01364891 217.00 Class 284802 0 284803 0 284804 0 284805 0 284806 0 284807 0
table(creditcard_data$Class) 0 1 284315 492
summary(creditcard_data$Amount) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 5.60 22.00 88.35 77.17 25691.16
names(creditcard_data) [1] "Time" "V1" "V2" "V3" "V4" "V5" "V6" "V7" [9] "V8" "V9" "V10" "V11" "V12" "V13" "V14" "V15" [17] "V16" "V17" "V18" "V19" "V20" "V21" "V22" "V23" [25] "V24" "V25" "V26" "V27" "V28" "Amount" "Class"
var(creditcard_data$Amount) [1] 62560.07
sd(creditcard_data$Amount) 250.1201
3. Data Manipulation
Using the scale() function, we will scale our data in this phase of the R data science project. This will be applied to our creditcard data amount’s amount portion.
Scaling and feature standardization are synonyms. The data is organized according to a defined range with the use of scaling.
Triangular Distribution in R – Data Science Tutorials
As a result, there are no extreme values in our dataset that could prevent our model from working properly. The way we’ll do this is as follows:
head(creditcard_data)
creditcard_data$Amount=scale(creditcard_data$Amount) NewData=creditcard_data[,-c(1)] head(NewData) V1 V2 V3 V4 V5 V6 1 -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778 2 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081 3 -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938 4 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317 5 -1.1582331 0.87773676 1.5487178 0.4030339 -0.40719338 0.09592146 6 -0.4259659 0.96052304 1.1411093 -0.1682521 0.42098688 -0.02972755 V7 V8 V9 V10 V11 V12 1 0.23959855 0.09869790 0.3637870 0.09079417 -0.5515995 -0.61780086 2 -0.07880298 0.08510165 -0.2554251 -0.16697441 1.6127267 1.06523531 3 0.79146096 0.24767579 -1.5146543 0.20764287 0.6245015 0.06608369 4 0.23760894 0.37743587 -1.3870241 -0.05495192 -0.2264873 0.17822823 5 0.59294075 -0.27053268 0.8177393 0.75307443 -0.8228429 0.53819555 6 0.47620095 0.26031433 -0.5686714 -0.37140720 1.3412620 0.35989384 V13 V14 V15 V16 V17 V18 1 -0.9913898 -0.3111694 1.4681770 -0.4704005 0.20797124 0.02579058 2 0.4890950 -0.1437723 0.6355581 0.4639170 -0.11480466 -0.18336127 3 0.7172927 -0.1659459 2.3458649 -2.8900832 1.10996938 -0.12135931 4 0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279 1.96577500 5 1.3458516 -1.1196698 0.1751211 -0.4514492 -0.23703324 -0.03819479 6 -0.3580907 -0.1371337 0.5176168 0.4017259 -0.05813282 0.06865315 V19 V20 V21 V22 V23 V24 1 0.40399296 0.25141210 -0.018306778 0.277837576 -0.11047391 0.06692808 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953 0.10128802 -0.33984648 3 -2.26185709 0.52497973 0.247998153 0.771679402 0.90941226 -0.68928096 4 -1.23262197 -0.20803778 -0.108300452 0.005273597 -0.19032052 -1.17557533 5 0.80348692 0.40854236 -0.009430697 0.798278495 -0.13745808 0.14126698 6 -0.03319379 0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658 V25 V26 V27 V28 Amount Class 1 0.1285394 -0.1891148 0.133558377 -0.02105305 0.24496383 0 2 0.1671704 0.1258945 -0.008983099 0.01472417 -0.34247394 0 3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 1.16068389 0 4 0.6473760 -0.2219288 0.062722849 0.06145763 0.14053401 0 5 -0.2060096 0.5022922 0.219422230 0.21515315 -0.07340321 0 6 -0.2327938 0.1059148 0.253844225 0.08108026 -0.33855582 0
4. Data Modeling
Our complete dataset will be standardized before being divided into a training set and a test set with a split ratio of 0.80.
As a result, 80% of our data will be ascribed to the train data and 20% to the test data. The dim() function will then be used to determine the dimensions.
How to Calculate Relative Frequencies in R? – Data Science Tutorials
library(caTools) set.seed(123) data_sample = sample.split(NewData$Class,SplitRatio=0.80) train_data = subset(NewData,data_sample==TRUE) test_data = subset(NewData,data_sample==FALSE) dim(train_data) [1] 227846 30 dim(test_data) [1] 56961 30
5. Fitting Logistic Regression Model
We will fit our first model in this phase of the project to detect credit card fraud. With logistic regression, we’ll start.
For estimating the likelihood of a result in a class, such as pass/fail, positive/negative, and in our instance, fraud/not fraud, logistic regression is used.
Following are the steps we take to apply this model to the test data:
Logistic_Model=glm(Class~.,test_data,family=binomial()) summary(Logistic_Model) Call: glm(formula = Class ~ ., family = binomial(), data = test_data) Deviance Residuals: Min 1Q Median 3Q Max -4.9019 -0.0254 -0.0156 -0.0078 4.0877 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -12.52800 10.30537 -1.216 0.2241 V1 -0.17299 1.27381 -0.136 0.8920 V2 1.44512 4.23062 0.342 0.7327 V3 0.17897 0.24058 0.744 0.4569 V4 3.13593 7.17768 0.437 0.6622 V5 1.49014 3.80369 0.392 0.6952 V6 -0.12428 0.22202 -0.560 0.5756 V7 1.40903 4.22644 0.333 0.7388 V8 -0.35254 0.17462 -2.019 0.0435 * V9 3.02176 8.67262 0.348 0.7275 V10 -2.89571 6.62383 -0.437 0.6620 V11 -0.09769 0.28270 -0.346 0.7297 V12 1.97992 6.56699 0.301 0.7630 V13 -0.71674 1.25649 -0.570 0.5684 V14 0.19316 3.28868 0.059 0.9532 V15 1.03868 2.89256 0.359 0.7195 V16 -2.98194 7.11391 -0.419 0.6751 V17 -1.81809 4.99764 -0.364 0.7160 V18 2.74772 8.13188 0.338 0.7354 V19 -1.63246 4.77228 -0.342 0.7323 V20 -0.69925 1.15114 -0.607 0.5436 V21 -0.45082 1.99182 -0.226 0.8209 V22 -1.40395 5.18980 -0.271 0.7868 V23 0.19026 0.61195 0.311 0.7559 V24 -0.12889 0.44701 -0.288 0.7731 V25 -0.57835 1.94988 -0.297 0.7668 V26 2.65938 9.34957 0.284 0.7761 V27 -0.45396 0.81502 -0.557 0.5775 V28 -0.06639 0.35730 -0.186 0.8526 Amount 0.22576 0.71892 0.314 0.7535 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1443.40 on 56960 degrees of freedom Residual deviance: 378.59 on 56931 degrees of freedom AIC: 438.59 Number of Fisher Scoring iterations: 17
Following a summary of our model, the following graphs will help us visualize it:
plot(Logistic_Model)
We will draw the ROC curve in order to evaluate the effectiveness of our model.
How to Label Outliers in Boxplots in ggplot2? (datasciencetut.com)
Receiver Optimistic Characteristics, or ROC, is another name for them.
In order to do this, we will first load the ROC package before plotting our ROC curve and evaluating its effectiveness.
library(pROC) lr.predict <- predict(Logistic_Model,train_data, probability = TRUE) auc.gbm = roc(test_data$Class, lr.predict, plot = TRUE, col = "blue")
6. Fitting a Decision Tree Model
We shall put a decision tree algorithm into practice in this part. Plotting the results of a decision using decision trees.
These results are essentially a consequence that allows us to determine what class the object belongs to.
Now that our decision tree model has been implemented, we will plot it via the rpart.plot() function.
To draw the decision tree, we will explicitly employ recursive partitioning.
library(rpart) library(rpart.plot) decisionTree_model <- rpart(Class ~ . , creditcard_data, method = 'class') predicted_val <- predict(decisionTree_model, creditcard_data, type = 'class') probability <- predict(decisionTree_model, creditcard_data, type = 'prob') rpart.plot(decisionTree_model)
7. Artificial Neural Network
A sort of machine learning algorithm that is based on the human nervous system is called an artificial neural network.
The ANN models may do categorization on the input data and can learn patterns from historical data. The neural net package that would enable us to use our ANNs is imported.
Gamma distribution in R – Data Science Tutorials
Then, we used the plot() function to plot it. Now, there is a range of values for Artificial Neural Networks that is between 1 and 0.
We established a threshold of 0.5, meaning that numbers above this value correspond to 1, and values below this value to 0.
We put this into practice as follows:
library(neuralnet) ANN_model =neuralnet (Class~.,train_data,linear.output=FALSE) plot(ANN_model) predANN=compute(ANN_model,test_data) resultANN=predANN$net.result resultANN=ifelse(resultANN>0.5,1,0)
8. Gradient Boosting (GBM)
A well-liked machine learning approach called gradient boosting is employed to carry out classification and regression tasks.
Weak decision trees are one of the underlying ensemble models that make up this model. An effective gradient-boosting model is created by the combination of these decision trees.
The following is how we’ll incorporate the gradient descent algorithm into our model:
Random Forest Machine Learning Introduction – Data Science Tutorials
library(gbm, quietly=TRUE) system.time( model_gbm <- gbm(Class ~ . , distribution = "bernoulli" , data = rbind(train_data, test_data) , n.trees = 500 , interaction.depth = 3 , n.minobsinnode = 100 , shrinkage = 0.01 , bag.fraction = 0.5 , train.fraction = nrow(train_data) / (nrow(train_data) + nrow(test_data)) ) ) # Determine best iteration based on test data gbm.iter = gbm.perf(model_gbm, method = "test") model.influence = relative.influence(model_gbm, n.trees = gbm.iter, sort. = TRUE) plot(model_gbm) gbm_test = predict(model_gbm, newdata = test_data, n.trees = gbm.iter) gbm_auc = roc(test_data$Class, gbm_test, plot = TRUE, col = "red") print(gbm_auc)
Summary
We learned how to create a credit card fraud detection model using machine learning as part of our R Data Science project.
In order to develop this model, we used a range of ML methods. We also presented the performance curves for each model.
We learned how to distinguish fraudulent transactions from other forms of data by analyzing and visualizing data.
I hope you liked the aforementioned R project. Use the comments section to share your knowledge and questions.
Hi. Is the dataset publicly available? We can’t access it with your statement read.csv(“D:/RStudio/creditcard.csv”)
Thank you.
https://drive.google.com/file/d/1CTAlmlREFRaEN3NoHHitewpqAtWS5cVQ/view