-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-tidying.Rmd
executable file
·122 lines (87 loc) · 2.88 KB
/
03-tidying.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
title: 'Week 04: Examples for Data Tidying'
---
# Revisit Population data obtained from http://transmonee.org/database/
# Visit at the web site of Monitoring the situation of
# children and women in Europe and Central Asia
# at http://transmonee.org/
# Go to http://transmonee.org/database/ and
# Download Excel file named
# "Population at the beginning of the year by sex and selected age groups"
# into to your working directory and rename it as Population-1989-2015
# since the file name is very long.
```{r}
library(readxl)
pop_red <- read_excel("data/Population-1989-2019.xlsx", range = cell_rows(5:40))
View(pop_red)
```
# Column 1 and 2 does not have names.
# Investigate the data
```{r}
str(pop_red)
```
# Did you notice any problem?
# The variables 1989-2019 are character variables
# What is the reason for it?
# Actually they should have been numeric!..
# There are empty rows in the sheet and they are denoted by NA in data viewer panel
# There are missing values in the sheet and they are denoted by "-" in the sheet as well as in the data
# viewer panel.
# We do not need the second column.
# First drop all the rows which are "completely missing".
# drop_na() drops rows which are missing at all columns--complete case analysis
```{r}
library(tidyr)
pop_red_tidy <- pop_red %>%
drop_na()
View(pop_red_tidy)
```
```{r}
str(pop_red_tidy)
```
# Rename the first column as "Country", drop the second column, and
# sort by Country
```{r}
pop_red_tidy_more <- pop_red_tidy %>%
drop_na() %>%
rename(Country = "...1") %>%
select(-"...2") %>% #unquoted ...2 is also OK, but has low readability.
arrange(Country)
View(pop_red_tidy_more)
```
# Investigate the data
```{r}
str(pop_red_tidy_more)
```
# We need to replace "-" with "NA".
```{r}
# read na_if() as: if "-", then it is na
pop_red_withna <- pop_red_tidy_more %>%
na_if("-") #from dplyr
View(pop_red_withna)
```
```{r}
str(pop_red_withna)
```
# Except the Country column, convert character variables
# into numeric variables with mutate_
```{r}
pop_red_tidy_tidy <- pop_red_withna %>%
#mutate_at(vars(1989:2019), as.numeric) #not works
mutate_at(vars("1989":"2019"), as.numeric) #not as.numeric(), () drops
str(pop_red_tidy_tidy)
```
```{r}
View(pop_red_tidy_tidy)
```
# Lastly, this data is in wide format.
# Turn this data into long format so that
# we can use it later (for visualization etc)
# except country variable, take the columns, add column
# names (values) to under a new column named "Year", and
# assign the values to another column named "Population".
```{r}
pop_red_long <- pop_red_tidy_tidy %>%
pivot_longer(!Country, names_to="Year", values_to="Population")
View(pop_red_long)
```