Hypothesis Testing in R Programming
Hypothesis Testing in R Programming, Hypothesis testing is a statistical method used to determine whether the observed data supports a specific hypothesis or not.
In R programming, there are both parametric and non-parametric methods for hypothesis testing.
Parametric methods assume that the data follows a specific distribution, such as normal distribution, while non-parametric methods do not make any assumptions about the distribution of the data.
In this tutorial, we will discuss two examples of hypothesis testing using both parametric and non-parametric methods in R programming.
Statistical Hypothesis Testing-A Step by Step Guide »
Example 1: Testing the Mean of a Normal Distribution (Parametric Method)
Let’s say we have a sample of 20 measurements from a manufacturing process with a mean of 100 units and a standard deviation of 5 units.
We want to test whether the true mean is actually 100 units or if it is different from that value.
To perform this test using R, we can use the t.test() function. This function performs a two-sample t-test, which compares the means of two samples to determine whether they are statistically different.
In this case, we only have one sample, but we can still use this function by specifying the alternative hypothesis as “two.sided” (meaning we are testing whether the mean is different from the hypothesized value in either direction) and setting the null hypothesis to be equal to our hypothesized value (100).
Here’s how you can perform this test in R:
# Load necessary libraries
library(tidyverse)
# Create a vector with our sample data
data <- c(101, 98, 103, 99, 102, 104, 97, 100, 98, 102, 99, 104, 103, 98, 102, 102, 97, 103, 98)
# Calculate the mean and standard deviation of our sample data
mean_data <- mean(data) # Calculate the mean of our sample data mean_data [1] 100.5263
sd_data <- sd(data) # Calculate the standard deviation of our sample data sd_data [1] 2.435123
n <- length(data) # Calculate the number of observations in our sample data n 19
# Perform a two-sample t-test using the t.test() function to test whether our hypothesized mean is significantly different from the true mean.
The alternative hypothesis is set to “two.sided” and the null hypothesis is set to be equal to our hypothesized value (100).
The confidence level is set to be 95%. The output will include a p-value and confidence interval for the true mean.
If the p-value is less than our chosen significance level (usually set at 0.05), then we reject the null hypothesis and conclude that there is a statistically significant difference between our hypothesized mean and the true mean.
If not, then we fail to reject the null hypothesis and conclude that there is no statistically significant difference between our hypothesized mean and the true mean.
T <- t.test(data, mu = 100) # Perform a two-sample t-test using our sample data and hypothesized mean of 100 units.
print(T) # Print out summary statistics for our two-sample t-test including p-value and confidence interval for true mean.
T One Sample t-test data: data t = 0.94211, df = 18, p-value = 0.3586 alternative hypothesis: true mean is not equal to 100 95 percent confidence interval: 99.35262 101.70001 sample estimates: mean of x 100.5263
When to use Kruskal Wallis Test »
Example 2: Testing Median of Two Independent Samples (Non-Parametric Method)
Let’s say we have two independent samples from two different manufacturing processes with different means but similar variances.
We want to test whether there is a significant difference in their medians using a non-parametric method called Wilcoxon rank sum test (also known as Mann-Whitney U test).
This method does not assume that the data follows any specific distribution but instead ranks each observation based on its value and then compares the ranks between the two samples to determine whether they are statistically different.
Here’s how you can perform this test in R:
# Load necessary libraries
library(tidyverse)
# Create vectors with our sample data for both processes x and y respectively.
x <- c(1.83, 0.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30) y <- c(0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29)
wilcox.test(x, y, paired = FALSE)
Wilcoxon rank sum test with continuity correction # Perform Wilcoxon rank sum test using wilcox() function to compare medians between Process x and Process y. data: x and y W = 58, p-value = 0.1329 alternative hypothesis: true location shift is not equal to 0
wilcox.test(x, y, paired = FALSE, alternative = "greater") wilcox.test(x, y, alternative = "less") wilcox.test(x, y, alternative = "less", exact = FALSE, correct = FALSE) # The same.
The output will include a p-value indicating whether there is a statistically significant difference between medians of both processes or not.
One-sample Wilcoxon test in R »
Example 3: Correlation Test
This test is used to determine whether there is an association between the paired samples or to examine the correlation of the two vectors supplied in the function call.
How to Perform a Log Rank Test in R » Data Science Tutorials
Syntax: cor.test(x, y) Parameters: x and y: represents numeric data vectors
cor.test(x, y)
Pearson's product-moment correlation data: x and y t = 4.231, df = 7, p-value = 0.003883 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4205813 0.9673116 sample estimates: cor 0.8478765