Detecting and Dealing with Outliers: First Step

Detecting and Dealing with Outliers, We’re going to look a little bit more at these mammals’ sleep data.

Let’s take a summary of them.

library(mice)
summary(mammalsleep)
 species         bw                brw         
 African elephant         : 1   Min.   :   0.005   Min.   :   0.14  
 African giant pouched rat: 1   1st Qu.:   0.600   1st Qu.:   4.25  
 Arctic Fox               : 1   Median :   3.342   Median :  17.25  
 Arctic ground squirrel   : 1   Mean   : 198.790   Mean   : 283.13  
 Asian elephant           : 1   3rd Qu.:  48.202   3rd Qu.: 166.00  
 Baboon                   : 1   Max.   :6654.000   Max.   :5712.00  
 (Other)                  :56                   
                    
      sws               ps              ts             mls         
 Min.   : 2.100   Min.   :0.000   Min.   : 2.60   Min.   :  2.000  
 1st Qu.: 6.250   1st Qu.:0.900   1st Qu.: 8.05   1st Qu.:  6.625  
 Median : 8.350   Median :1.800   Median :10.45   Median : 15.100  
 Mean   : 8.673   Mean   :1.972   Mean   :10.53   Mean   : 19.878  
 3rd Qu.:11.000   3rd Qu.:2.550   3rd Qu.:13.20   3rd Qu.: 27.750  
 Max.   :17.900   Max.   :6.600   Max.   :19.90   Max.   :100.000  
 NA's   :14       NA's   :12      NA's   :4       NA's   :4       
 
       gt               pi             sei             odi       
 Min.   : 12.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.: 35.75   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
 Median : 79.00   Median :3.000   Median :2.000   Median :2.000  
 Mean   :142.35   Mean   :2.871   Mean   :2.419   Mean   :2.613  
 3rd Qu.:207.50   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :645.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :4   

So, if you use the summary command, it will automatically take a data frame and try to figure out all of the information about it, as well as how to summarise each column independently.

How to perform One-Sample Wilcoxon Signed Rank Test in R?

Here we have the various species, which is the first variable (it’s not the row names, it’s the first variable), and it indicates that you have one of each. There are 56 that aren’t included in this list.

But that’s not the case with the others. They’re all labeled with the correct species.

The first one we notice is this body mass, which is also known as body weight, and we can see that the range, from minimal to maximum, is really wide.

Detecting and Dealing with Outliers

So, this is an occasion to point out that you should be on the lookout for values that are significantly different from others, which we refer to as outliers.

So, for example, if we go in and ask, “What is the maximum of the mammalsleep$bw?”

which.max(mammalsleep$bw)

and it tells us 1 and so, that’s because out of all the weights, We can look at the real culprit here.

This tells us that the African elephant is actually the largest, and these are the values we have for them. Let’s look at the bare minimum.

Best GGPlot Themes You Should Know – (datasciencetut.com)

mammalsleep[which.min(mammalsleep$bw),]
species
32 Lesser short-tailed shrew
      bw  brw sws  ps  ts mls
32 0.005 0.14 7.7 1.4 9.1 2.6
     gt pi sei odi
32 21.5  5   2   4

Both of these, this being the lesser short-tailed shrew, are reasonable.

That is, they are not errors, therefore you may obtain the maximum and the value is, for example,

Because it’s 9,999, but it doesn’t make sense, and you have something that does, so if we discovered a possum that was 9,999, that was bigger than the elephant.

That was an encoding of a missing variable to us, but it’s not a valid value.

So, in order to accomplish imputation, I’d have to substitute the numerical value with a genuine missing value.

As a result, documenting and looking for outliers is useful as a data double-check.

There are numerous issues that arise due to typos and values that are merged together, as well as missing columns, which cause your data to provide absolutely absurd results, and it is your obligation to correct them.

This is something that the computer can assist you with, but you must become involved in all elements of the process.

So, taking a look at the maximum and minimum is a good approach to go about it.

So here’s what we do: you have to document everything. I’ve discovered this value, and this is the largest, and this is the smallest, so make notes for yourself and the people for whom you’re writing the report.

So that everyone is aware that these are the extreme points and that what they know should be consistent.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

20 − 19 =