Convert Categorical Variable to Numeric in R

Convert Categorical Variable to Numeric in R, In this tutorial, you’ll learn how to convert categorical values into quantitative values to make statistical modeling easier.

Most statistical models can’t take in strings as inputs. We’ll go through the conversion based on the airline data set that we reviewed in the previous post.

The “Reporting Airline” feature in the Airline dataset is a categorical variable with nine character types: “AA”, “AS”, “B6”, “DL”, “HP”, “PA (1)”, “TW”, “UA” or “VX”.

You’ll need to convert these variables into a numeric format for further investigation.

Convert Categorical Variable to Numeric in R

Let’s load the data set into the R console.


To fix this difficulty, add additional features matching to each unique element in the original feature you want to encode the data.

Because the feature “Reporting Airline” includes nine values, you need to construct nine new features, such as “AA,” “AS,” “B6,” and so on.

When a value appears in the original feature, you set the new feature’s matching value to one; the remainder of the features are set to zero.

In the first row of the reporting airline example, the reporting airline is “UA”. As a result, the feature “UA” would be set to one, while the other features would be set to zero.

Similarly, the reporting airline value for the second row is “AS.” As a result, we set “AS” to one and all other features to zero.

Spread() Method

To convert category variables to dummy variables in tidyverse, use the spread() method.

data %>% mutate(dummy=1) %>%
spread(key=Reporting_Airline,value=dummy, fill=0) %>% slice(1:5)

To do so, use the spread() function with three arguments:

key, which is the column to convert into categorical values, in this case, “Reporting Airline”;

value, which is the value you want to set the key to (in this case “dummy”);

and fill, which fills the missing values with zero if they are otherwise left as NAs.

Alternatively, you can assign flight delay, “ArrDelay,” values to each feature instead of the dummy numbers 0 or 1.

In the first row of the reporting airline example, the reporting airline is “UA,” and the arrival delay is 2 minutes. As a result, you can set the feature “UA” to 2 and the remaining features to NA.

Similarly, the reported airline value for the second row is “AS,” and the arrival delay is -21 minutes. you can set the feature “AS” to -21 and the remaining features to NA.

You’ll also utilize the spread() method to accomplish this. Based on the previous example, the key argument can be set to the “Reporting Airline” column.

The value argument can be either 0 or 1, or the values in the “ArrDelay” column can be assigned.

data %>% spread(key=Reporting_Airline,value=ArrDelay)

Because no parameter fill is declared in this example, the function will return NA for the missing data.

Remember that if you utilize the “Reporting Airline” and “ArrDelay” columns as inputs in the spread() method, the “Reporting Airline” and “ArrDelay” columns will be removed from the output by default.

You learned a strategy for converting category values to numeric values in this tutorial. That means you learned “one-hot encoding”.

How is it possible?

Because this method is known as “one-hot encoding”.

You may also like...

1 Response

  1. Don MacQueen says:

    Looks like the goal, at least in the first example, is to create a design matrix.
    ( )

    Since R was built from the beginning for statistical analysis, base R includes a function for this. Here’s an example.

    ## create a character variable
    > chv model.matrix(~chv-1)
    chva chvb chvc chvd
    1 1 0 0 0
    2 0 1 0 0
    3 0 0 1 0
    4 0 0 0 1
    [1] 1 1 1 1
    [1] “contr.treatment”

    ## or for a model with an intercept
    > model.matrix(~chv)
    (Intercept) chvb chvc chvd
    1 1 0 0 0
    2 1 1 0 0
    3 1 0 1 0
    4 1 0 0 1
    [1] 0 1 1 1
    [1] “contr.treatment”

    After using model.matrix() it would still be necessary to cbind() the design matrix to the original data. And rename the design matrix columns if desired.

    For the example airline data, the syntax would be just a little different,
    model.matrix( ~Reporting_Airline-1 , data=data )
    (untested, since I don’t have the example data handy)

    It’s certainly useful and informative to be able to construct a design matrix. But many if not most (or all?) R modeling functions construct the design matrix internally as needed, so normally the user need not construct it themselves.

    And just one side comment…the model.matrix() call with the airline data illustrates that “data” is a poor choice of name for a user object, since “data” is both a commonly used argument name within R modeling functions, and also a built-in R function:

    > find(‘data’)
    [1] “package:utils”

Leave a Reply

Your email address will not be published. Required fields are marked *

20 + 4 =