Chapter 2 Data preperation

##Data We can enter CSV data in R using read.csv function or read_csv function from readr package. However, for now we will use data that already exist in the datasets packages that we downloaded

Lets load the library ‘datsets’

Lets look at the chickweight data

head(ChickWeight)
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
str(ChickWeight)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   578 obs. of  4 variables:
##  $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
##  $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
##  $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Diet
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Time"
##   ..$ y: chr "Body weight"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(days)"
##   ..$ y: chr "(gm)"

##Data manipulation with dplyr and tidyverse

Let us create a column that calculates number of chicks following each type of diet

Lets look at the chickweight data

ChickWeight2<-ChickWeight %>% 
                 count(Diet)
   
head(ChickWeight2)
##   Diet   n
## 1    1 220
## 2    2 120
## 3    3 120
## 4    4 118

Let’s look at mean weight of chick in each diet category

ChickWeight3<-ChickWeight %>% 
              group_by(Diet) %>% 
              mutate(mean.wt= mean(weight)) # muatate can be used to create new columns
   
head(ChickWeight3)
## # A tibble: 6 x 5
## # Groups:   Diet [1]
##   weight  Time Chick Diet  mean.wt
##    <dbl> <dbl> <ord> <fct>   <dbl>
## 1     42     0 1     1        103.
## 2     51     2 1     1        103.
## 3     59     4 1     1        103.
## 4     64     6 1     1        103.
## 5     76     8 1     1        103.
## 6     93    10 1     1        103.

Let’s arrange the data in wide form, ggplot often likes the data to be in long form, that is stacked in rows. For example, here we have Diet in rows.

Let’s look at some data in wide form

head(relig_income)
## # A tibble: 6 x 11
##   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
## 1 Agnostic      27        34        60        81        76       137        122
## 2 Atheist       12        27        37        52        35        70         73
## 3 Buddhist      27        21        30        34        33        58         62
## 4 Catholic     418       617       732       670       638      1116        949
## 5 Don’t k…      15        14        15        11        10        35         21
## 6 Evangel…     575       869      1064       982       881      1486        949
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## #   know/refused` <dbl>

Let’s arrange so that income is stacked in rows

relig_income2<-relig_income %>% pivot_longer(!religion, names_to="income",values_to="count")
head(relig_income2)
## # A tibble: 6 x 3
##   religion income  count
##   <chr>    <chr>   <dbl>
## 1 Agnostic <$10k      27
## 2 Agnostic $10-20k    34
## 3 Agnostic $20-30k    60
## 4 Agnostic $30-40k    81
## 5 Agnostic $40-50k    76
## 6 Agnostic $50-75k   137

Re-do the wide

relig_income.back<-relig_income2 %>% pivot_wider( names_from="income",values_from="count")
head(relig_income.back)
## # A tibble: 6 x 11
##   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
## 1 Agnostic      27        34        60        81        76       137        122
## 2 Atheist       12        27        37        52        35        70         73
## 3 Buddhist      27        21        30        34        33        58         62
## 4 Catholic     418       617       732       670       638      1116        949
## 5 Don’t k…      15        14        15        11        10        35         21
## 6 Evangel…     575       869      1064       982       881      1486        949
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## #   know/refused` <dbl>