# How to arrange training and testing datasets in R

How to arrange training and testing datasets in R, To divide a data frame into training and test sets for model construction in R, use the createDataPartition() function from the caret package.

The basic syntax used by this function is as follows:

`createDataPartition(y, times = 1, p = 0.5, list = TRUE, …)`

where:

y: vector of outcomes

times: number of partitions to create

p: percentage of data to use in the training set

list: whether or not to save the results in a list

The example below demonstrates how to use this function in practice.

## Example:- How to arrange training and testing datasets in R

Assume we have a data frame in R with 1,000 rows containing information about students’ study hours and their final exam scores:

Make this example replicable.

`set.seed(123)`

Let’s create a data frame

```df <- data.frame(hours=runif(1000, min=0, max=10),
score=runif(1000, min=40, max=100))```

Now we can view the head of the data frame

`head(df)`
```     hours    score
1 7.8355588 64.49499
2 0.7643654 77.62753
3 8.9624691 51.64345
4 0.1915280 85.47124
5 8.6345563 93.20182
6 8.9055675 44.02384```

Assume we want to fit a simple linear regression model that predicts final exam score based on hours studied.

Assume we want to train the model on 80% of the rows in the data frame and then test it on the remaining 20%.

The following code demonstrates how to split the data frame into training and testing sets using the caret package’s createDataPartition() function.

`library(caret)`

Divide the data frame into training and testing sets.

`train_indices <- createDataPartition(df\$score, times=1, p=.8, list=FALSE)`

Now ready to create a training set

`dftrain <- df[train_indices , ]`

Now we can create a testing set

`dftest  <- df[-train_indices, ]`

Let’s view the number of rows in each set

```nrow(dftrain)
[1] 800```
```nrow(dftest)
[1] 200```

As we can see, our training dataset has 800 rows, which accounts for 80% of the original dataset.

Similarly, our test dataset has 200 rows, which represents 20% of the original dataset.

The first few rows of each set can also be seen:

Now we can view the head of training set

`head(dftrain)`
```     hours    score
1 8.966972 55.93220
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
5 9.082078 97.29928
7 8.983897 42.34600```

Let’s view the head of the testing set

`head(dftest)`
```      hours    score
6  2.016819 47.10139
12 2.059746 96.67170
18 7.176185 92.61150
23 2.121425 89.17611
24 6.516738 50.47970
25 1.255551 90.58483```

We can then use the training set to train the regression model and the testing set to evaluate its performance.

