How to Calculate Jaccard Similarity in R

by finnstats

Jaccard Similarity in R, The Jaccard similarity index compares two sets of data to see how similar they are. It might be anywhere between 0 and 1. The greater the number, the closer the two sets of data are.

The Jaccard Index is a statistical measure that is frequently used to compare the similarity of binary variable sets. It is the length of the union divided by the size of the intersection between the sets.

The following formula is used to calculate the Jaccard similarity index:

Jaccard Similarity = (number of observations in both sets) / (number in either set)

Or, written in notation form:

J(A, B) = |A∩B| / |A∪B|

This article will show you how to use R to calculate Jaccard Similarity between two sets of data.

Jaccard similarity in R

Assume that we have the following two sets of data.

a <- c(1,5,8,10,22,14,15,16,2,7)
b <- c(10,12,13,2,7,9,2,7,23,15)

To determine the Jaccard Similarity between the two sets, we can use the following function.

Repeated Measures of ANOVA in R Complete Tutorial »

Define Jaccard Similarity function

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

Let’s find the Jaccard Similarity between the two sets

jaccard(a, b)

[1] 0.25

The Jaccard Similarity between the two lists is 0.25. As mentioned above greater the number closer to the data sets.

If the two sets don’t exchange any values, the function will return 0. If the two sets are identical, the function will return 1.

Let see two examples here,

a <- c(1,5,8,10)
b <- c(11,6,12,13)
jaccard(a, b)

[1] 0

a <- c(1,5,8,10)
b <- c(1,5,8,10)
jaccard(a, b)

[1] 1

The function is also applicable to sets containing strings.

a <- c('potato', 'tomotto', 'chips', 'baloon')
b <- c('car', 'chips', 'bird', 'salt')
jaccard(a, b)

[1] 0.1428571

You can also use this method to discover the Jaccard distance between two sets, which is calculated as 1 – Jaccard Similarity and represents the dissimilarity between two sets.

a <- c(1,5,8,10,22,14,15,16,2,7)
b <- c(10,12,13,2,7,9,2,7,23,15)
1-jaccard(a, b)

[1] 0.75

If you’re looking for a way to calculate the Jaccard similarity matrix, the vegan package is a good place to start. Many other similarities/dissimilarity measures can be calculated with the vegdist() function.

install.packages("vegan")
library(vegan)
a <- c(1,5,8,10,22,14,15,16,2,7)
b <- c(10,12,13,2,7,9,2,7,23,15)
df<-data.frame(a,b)
vegdist(df, method = "jaccard")

          1         2         3         4         5         6         7         8
2  0.3529412                                                                      
3  0.4761905 0.1904762                                                            
4  0.8500000 0.6818182 0.5652174                                                  
5  0.7500000 0.6470588 0.5714286 0.5862069                                        
6  0.5833333 0.4615385 0.3703704 0.4782609 0.3225806                              
7  0.8800000 0.7407407 0.6428571 0.2941176 0.4137931 0.3333333                    
8  0.6923077 0.5714286 0.4827586 0.4782609 0.2068966 0.1600000 0.2608696          
9  0.5600000 0.5000000 0.5161290 0.8787879 0.8000000 0.7027027 0.8947368 0.7692308
10 0.5000000 0.2272727 0.1304348 0.6400000 0.6216216 0.4482759 0.7000000 0.5483871
           9
2           
3           
4           
5           
6           
7           
8           
9           
10 0.4333333

Significance of Spearman’s Rank Correlation

Subscribe to our newsletter!

[newsletter_form type=”minimal”]