Kappa Coefficient Interpretation Guide

The kappa coefficient, which is used to determine inter-rater reliability or agreement, is explained in this article.

The magnitude of kappa is frequently more important in most applications than the statistical significance of kappa.

Based on the Cohen’s Kappa value, the following classifications have been proposed to interpret the strength of the agreement (Altman 1999, Landis JR (1977)).

Machine Learning Archives » Data Science Tutorials

VALUE OF K	STRENGTH OF AGREEMENT
< 0	Poor
0.01 – 0.20	Slight
0.21-0.40	Fair
0.41-0.60	Moderate
0.61-0.80	Substantial
0.81 – 1.00	Almost perfect

This view, on the other hand, allows for very little consensus among raters to be classified as “substantial.”

According to the table, 61 percent of agreement is deemed good, however, depending on the field, this could be troublesome.

Almost 40 percent of the data in the dataset is erroneous. This could lead to recommendations for modifying practices based on incorrect evidence in healthcare research.

For a clinical laboratory, having 40% of sample evaluations be incorrect would be a significant quality issue (McHugh 2012).

This is why many texts advocate an inter-rater agreement of 80 percent as the minimum acceptable level.

Data Science Statistics Jobs » Are you looking for Data Science Jobs?

If the kappa is less than 0.60, there is insufficient agreement among the raters, and the study results should be viewed with caution.

According to Fleiss et al. (2003), for most purposes,

Values more than or equal to 0.75 may be interpreted as excellent agreement beyond chance, values less than or equal to 0.40 may be interpreted as poor agreement beyond chance, and values between 0.40 and 0.75 may be interpreted as fair to a good agreement beyond chance.

The table below suggests another logical interpretation of kappa from (McHugh 2012).

VALUE OF K	LEVEL OF AGREEMENT
0 – 0.20	None (0 – 4%)
0.21 – 0.39	Minimal (4 – 15%)
0.40 – 0.59	Weak (15 – 35%)
0.60 – 0.79	Moderate (35 – 63%)
0.80 – 0.90	Strong (64 – 81%)
Above 0.90	Almost Perfect (82 – 100%)

The column “percent of data that is dependable” in the table above corresponds to the squared kappa, which is a direct counterpart of the squared correlation coefficient.