Using R for exploratory data analysis (summarising data)
Hi Everyone, welcome to the first blogpost of my data science journey. In this post I am going to show you how to quickly and easily implement a variety of common exploratory data analyses using R Statistical software. Such analyses are commonly used for descriptive studies.
For this blogpost, I assume that you have some basic understanding of the R programming language. But if not, no worries there is a plethora of resources on R programming on the internet. One of the brilliant resources to get you up and running with R and R Studio is the R Ladies Sydney webpage.
For this post I will be showing you how to obtain basic frequency data, mean, median, mode, range, interquartile range variance and test for normality.
There are hundreds of ways to obtain such information. You do not need to install any packages to perform many statistical analysis - this means that the base R has inbuilt commands to do most of the stats but for the purpose of ease and for other various uses it is worth installing the package tidyverse
.
## # A tibble: 6 x 14
## ID Age Childnumber childid Gender AutoreractorSE `Myopia Group` SE_P_AVE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A000… 10 3 2 1 1.12 0 -0.688
## 2 A000… 7 3 3 1 1.75 0 -0.688
## 3 A000… 11 2 1 1 1.68 0 0.125
## 4 A000… 7 2 2 1 3.30 0 0.125
## 5 A000… 12 3 1 2 1.18 0 -0.0625
## 6 A000… 10 3 2 2 0.75 0 -0.0625
## # … with 6 more variables: NEARWORKtime <dbl>, OUTDOORtime <dbl>,
## # GROUP_NEAR <dbl>, GROUP_OUT <dbl>, EDUM_P_NEW <dbl>, EDUF_P_NEW <dbl>
Here, I have uploaded a datasheet that I obtained from
this paper. The paper determines the relationship between outdoor activities, nearwork and myopia. The dataset is denoted as d1
. The easiest way to get the descriptive data of a dataset is to call the function summary(d1)
.
Let’s see what hapends when I type summary(d1):
summary(d1)
## ID Age Childnumber childid
## Length:574 Min. : 6.00 Min. :1.000 Min. :1.000
## Class :character 1st Qu.: 9.00 1st Qu.:1.000 1st Qu.:1.000
## Mode :character Median :11.00 Median :2.000 Median :1.000
## Mean :10.63 Mean :1.962 Mean :1.481
## 3rd Qu.:12.00 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :18.00 Max. :4.000 Max. :4.000
## Gender AutoreractorSE Myopia Group SE_P_AVE
## Min. :1.000 Min. :-6.055000 Min. :0.0000 Min. :-5.5312
## 1st Qu.:1.000 1st Qu.:-0.433750 1st Qu.:0.0000 1st Qu.:-0.7812
## Median :1.000 Median : 0.305000 Median :0.0000 Median :-0.3750
## Mean :1.449 Mean : 0.001577 Mean :0.2422 Mean :-0.5259
## 3rd Qu.:2.000 3rd Qu.: 0.750000 3rd Qu.:0.0000 3rd Qu.:-0.0625
## Max. :2.000 Max. : 3.305000 Max. :1.0000 Max. : 1.0312
## NEARWORKtime OUTDOORtime GROUP_NEAR GROUP_OUT
## Min. : 2.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 3.429 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 4.429 Median :2.429 Median :2.000 Median :2.000
## Mean : 4.751 Mean :2.936 Mean :2.012 Mean :2.002
## 3rd Qu.: 5.500 3rd Qu.:3.714 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :14.429 Max. :9.643 Max. :3.000 Max. :3.000
## EDUM_P_NEW EDUF_P_NEW
## Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :3.000
## Mean :2.237 Mean :2.709
## 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
Now you can see the mean, median, range and quartiles of the variables. Remember, that the summary()
command provides these values for all of the variables either numeric or categorical and it is your job to identify and select the variable appropriately. For instance, the mean of gender doesn’t make sense, does it ?
More detailed descriptive analysis of a dataset can be obtained by using package psych
. I am installing this package by typing the command install.package("psych")
and calling the function library(psych)
.
In order to get the descriptive data, I am typing the command describe(d1)
- remember d1
is my dataset that contains all these variables such as age, gender etc. Make sure, your dataset does not have missing values otherwise it will give NA or NAN values as you can see below for ID
variable.
library(psych)
describe(d1)
## vars n mean sd median trimmed mad min max range skew
## ID* 1 574 NaN NA NA NaN NA Inf -Inf -Inf NA
## Age 2 574 10.63 2.47 11.00 10.55 2.97 6.00 18.00 12.00 0.33
## Childnumber 3 574 1.96 0.78 2.00 1.93 1.48 1.00 4.00 3.00 0.33
## childid 4 574 1.48 0.66 1.00 1.37 0.00 1.00 4.00 3.00 1.15
## Gender 5 574 1.45 0.50 1.00 1.44 0.00 1.00 2.00 1.00 0.20
## AutoreractorSE 6 574 0.00 1.24 0.30 0.14 0.82 -6.05 3.31 9.36 -1.33
## Myopia Group 7 574 0.24 0.43 0.00 0.18 0.00 0.00 1.00 1.00 1.20
## SE_P_AVE 8 574 -0.53 0.76 -0.38 -0.43 0.51 -5.53 1.03 6.56 -1.79
## NEARWORKtime 9 574 4.75 1.62 4.43 4.54 1.48 2.00 14.43 12.43 1.49
## OUTDOORtime 10 574 2.94 1.40 2.43 2.81 1.06 1.00 9.64 8.64 1.06
## GROUP_NEAR 11 574 2.01 0.81 2.00 2.02 1.48 1.00 3.00 2.00 -0.02
## GROUP_OUT 12 574 2.00 0.82 2.00 2.00 1.48 1.00 3.00 2.00 0.00
## EDUM_P_NEW 13 574 2.24 0.74 2.00 2.26 1.48 1.00 4.00 3.00 0.04
## EDUF_P_NEW 14 574 2.71 0.65 3.00 2.72 0.00 1.00 4.00 3.00 -0.41
## kurtosis se
## ID* NA NA
## Age -0.20 0.10
## Childnumber -0.60 0.03
## childid 0.55 0.03
## Gender -1.96 0.02
## AutoreractorSE 2.83 0.05
## Myopia Group -0.56 0.02
## SE_P_AVE 5.90 0.03
## NEARWORKtime 3.70 0.07
## OUTDOORtime 1.31 0.06
## GROUP_NEAR -1.48 0.03
## GROUP_OUT -1.52 0.03
## EDUM_P_NEW -0.45 0.03
## EDUF_P_NEW 0.26 0.03
So the additional results that we obtained from describe()
command as compared to summary()
command are sd (standard deviation), mad(mean absolute deviation), Kurtosis, Skewness, se(standard error), trimmed (trimmed mean).
There are some baseR commands such as mean()
, median()
, mode()
, sd()
but they can be applied to vectors only one by one but not to the whole dataset at once.
Ok, to wrap it up, in this tutorial we learned how to calculate mean, median, standard deviation, standard error, IQR, Range, Kurtosis and Skewness of a data.
Take home messages are:
- Remember to use
install.packages()
just once andlibrary()
everytime you use R if you need to load a specific package summary(dataset)
is a base R command so you do not need to install anything to obtain descriptive statistics.- If you want to obtain more detailed summary statistics please install
psych
package and use thedescribe(dataset)
command.
If you would like to learn more on different methods of summarising data see this link on My favourite R package for: summarising data by Adam Medcalf.
Good luck exploring you data. Feedbacks welcome.