Chapter 2 Data preperation
##Data We can enter CSV data in R using read.csv function or read_csv function from readr package. However, for now we will use data that already exist in the datasets packages that we downloaded
Lets load the library ‘datsets’
Lets look at the chickweight data
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 578 obs. of 4 variables:
## $ weight: num 42 51 59 64 76 93 106 125 149 171 ...
## $ Time : num 0 2 4 6 8 10 12 14 16 18 ...
## $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
## $ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "formula")=Class 'formula' language weight ~ Time | Chick
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' language ~Diet
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Time"
## ..$ y: chr "Body weight"
## - attr(*, "units")=List of 2
## ..$ x: chr "(days)"
## ..$ y: chr "(gm)"
##Data manipulation with dplyr and tidyverse
Let us create a column that calculates number of chicks following each type of diet
Lets look at the chickweight data
## Diet n
## 1 1 220
## 2 2 120
## 3 3 120
## 4 4 118
Let’s look at mean weight of chick in each diet category
ChickWeight3<-ChickWeight %>%
group_by(Diet) %>%
mutate(mean.wt= mean(weight)) # muatate can be used to create new columns
head(ChickWeight3)
## # A tibble: 6 x 5
## # Groups: Diet [1]
## weight Time Chick Diet mean.wt
## <dbl> <dbl> <ord> <fct> <dbl>
## 1 42 0 1 1 103.
## 2 51 2 1 1 103.
## 3 59 4 1 1 103.
## 4 64 6 1 1 103.
## 5 76 8 1 1 103.
## 6 93 10 1 1 103.
Let’s arrange the data in wide form, ggplot often likes the data to be in long form, that is stacked in rows. For example, here we have Diet in rows.
Let’s look at some data in wide form
## # A tibble: 6 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k… 15 14 15 11 10 35 21
## 6 Evangel… 575 869 1064 982 881 1486 949
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## # know/refused` <dbl>
Let’s arrange so that income is stacked in rows
relig_income2<-relig_income %>% pivot_longer(!religion, names_to="income",values_to="count")
head(relig_income2)
## # A tibble: 6 x 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
Re-do the wide
relig_income.back<-relig_income2 %>% pivot_wider( names_from="income",values_from="count")
head(relig_income.back)
## # A tibble: 6 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k… 15 14 15 11 10 35 21
## 6 Evangel… 575 869 1064 982 881 1486 949
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## # know/refused` <dbl>