How to Interpolate Missing Values in R With Example
How to Interpolate Missing Values, In today’s world, data comes from a variety of places, is collected through numerous streams, and is then evaluated using a variety of methodologies.
In this article, we’ve discussed missing values and how to deal with them using the zoo library.
To interpolate missing values in a data frame column in R, use the following basic syntax.
library(dplyr) library(zoo) df <- df %>% mutate(column_name = na.approx(column_name))
The example below demonstrates how to utilize this syntax in practice.
Interpolate Missing Values in R as an example
Let’s say we have the following data frame in R that shows a store’s total sales for 15 days in a row:
create a data frame
df <- data.frame(day=1:15, sales=c(2, 4, 9, 1, 10, 15, 2, NA, NA, 8, NA, 31, 32, 41, 45))
Now we can view the data frame
df
day sales 1 1 2 2 2 4 3 3 9 4 4 1 5 5 10 6 6 15 7 7 2 8 8 NA 9 9 NA 10 10 8 11 11 NA 12 12 31 13 13 32 14 14 41
Notice that the data frame is lacking sales numbers for four days.
Here’s what a basic line chart to show sales over time would look like:
To visualize sales, construct a line chart.
plot(df$sales, type='o', pch=16, col='red', xlab='Day', ylab='Sales')
in R, interpolate missing values
We can use the na.approx() function from the zoo package and the modify() method from the dplyr package to fill in the missing values.
Adding text labels to ggplot2 Bar Chart ยป finnstats
library(dplyr) library(zoo)
in the sales column, interpolate missing numbers
df <- df %>% mutate(sales = na.approx(sales))
Now we can view the updated data frame
df
day sales 1 1 2.0 2 2 4.0 3 3 9.0 4 4 1.0 5 5 10.0 6 6 15.0 7 7 2.0 8 8 4.0 9 9 6.0 10 10 8.0 11 11 19.5 12 12 31.0 13 13 32.0 14 14 41.0 15 15 45.0
It’s worth noting that each missing value has been updated.
Here’s what it would look like if we made a new line chart to show the updated data frame:
To visualize sales, construct a line chart.
plot(df$sales, type='o', pch=16, col='green', xlab='Day', ylab='Sales')
Notice that the values are chosen by the na.approx() function seem to fit the trend in the data quite well.