Skip to content

GuilloteauQ/miller-exercises

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Miller Exercises

This repository contains exercises to learn the Miller (mlr) tool. It is inspired by Ben Porter's Awk Course.

The exercises are focused for a data analysis workload involving CSVs.

You would like to add new exercises? You think the Miller solutions could be improved? Feel free to contribute!

Getting the data

The data comes from the RGriffin Kaggle dataset: 120 years of Olympic history

You can download the data as such:

wget https://raw.githubusercontent.com/GuilloteauQ/miller-exercises/main/olympics.csv

Exercise 1

Display the first few rows of the CSV in the pretty-print format

Solution
mlr --icsv --opprint head olympics.csv

head

Output:

id name                     sex age height weight team           noc games       year season city        sport         event                              medal
1  A Dijiang                M   24  180    80     China          CHN 1992 Summer 1992 Summer Barcelona   Basketball    Basketball Men's Basketball        NA
2  A Lamusi                 M   23  170    60     China          CHN 2012 Summer 2012 Summer London      Judo          Judo Men's Extra-Lightweight       NA
3  Gunnar Nielsen Aaby      M   24  NA     NA     Denmark        DEN 1920 Summer 1920 Summer Antwerpen   Football      Football Men's Football            NA
4  Edgar Lindenau Aabye     M   34  NA     NA     Denmark/Sweden DEN 1900 Summer 1900 Summer Paris       Tug-Of-War    Tug-Of-War Men's Tug-Of-War        Gold
5  Christine Jacoba Aaftink F   21  185    82     Netherlands    NED 1988 Winter 1988 Winter Calgary     Speed Skating Speed Skating Women's 500 metres   NA
5  Christine Jacoba Aaftink F   21  185    82     Netherlands    NED 1988 Winter 1988 Winter Calgary     Speed Skating Speed Skating Women's 1,000 metres NA
5  Christine Jacoba Aaftink F   25  185    82     Netherlands    NED 1992 Winter 1992 Winter Albertville Speed Skating Speed Skating Women's 500 metres   NA
5  Christine Jacoba Aaftink F   25  185    82     Netherlands    NED 1992 Winter 1992 Winter Albertville Speed Skating Speed Skating Women's 1,000 metres NA
5  Christine Jacoba Aaftink F   27  185    82     Netherlands    NED 1994 Winter 1994 Winter Lillehammer Speed Skating Speed Skating Women's 500 metres   NA
5  Christine Jacoba Aaftink F   27  185    82     Netherlands    NED 1994 Winter 1994 Winter Lillehammer Speed Skating Speed Skating Women's 1,000 metres NA

Exercise 2

Display only the rows for the Summer Olympics.

Solution
mlr --icsv --opprint filter '$season == "Summer"' then head olympics.csv

filter

Output:

id name                               sex age height weight team           noc games       year season city        sport      event                                  medal
1  A Dijiang                          M   24  180    80     China          CHN 1992 Summer 1992 Summer Barcelona   Basketball Basketball Men's Basketball            NA
2  A Lamusi                           M   23  170    60     China          CHN 2012 Summer 2012 Summer London      Judo       Judo Men's Extra-Lightweight           NA
3  Gunnar Nielsen Aaby                M   24  NA     NA     Denmark        DEN 1920 Summer 1920 Summer Antwerpen   Football   Football Men's Football                NA
4  Edgar Lindenau Aabye               M   34  NA     NA     Denmark/Sweden DEN 1900 Summer 1900 Summer Paris       Tug-Of-War Tug-Of-War Men's Tug-Of-War            Gold
8  Cornelia "Cor" Aalten (-Strannood) F   18  168    NA     Netherlands    NED 1932 Summer 1932 Summer Los Angeles Athletics  Athletics Women's 100 metres           NA
8  Cornelia "Cor" Aalten (-Strannood) F   18  168    NA     Netherlands    NED 1932 Summer 1932 Summer Los Angeles Athletics  Athletics Women's 4 x 100 metres Relay NA
10 Einar Ferdinand "Einari" Aalto     M   26  NA     NA     Finland        FIN 1952 Summer 1952 Summer Helsinki    Swimming   Swimming Men's 400 metres Freestyle    NA
12 Jyri Tapani Aalto                  M   31  172    70     Finland        FIN 2000 Summer 2000 Summer Sydney      Badminton  Badminton Men's Singles                NA
13 Minna Maarit Aalto                 F   30  159    55.5   Finland        FIN 1996 Summer 1996 Summer Atlanta     Sailing    Sailing Women's Windsurfer             NA
13 Minna Maarit Aalto                 F   34  159    55.5   Finland        FIN 2000 Summer 2000 Summer Sydney      Sailing    Sailing Women's Windsurfer             NA

Exercise 3

Display only the names of the athletes who won a medal at the Summer Olympics.

Solution
mlr --icsv --opprint filter '$season == "Summer" && $medal != "NA"' then cut -f name then head olympics.csv

Output:

name
Edgar Lindenau Aabye
Arvo Ossian Aaltonen
Arvo Ossian Aaltonen
Paavo Johannes Aaltonen
Paavo Johannes Aaltonen
Paavo Johannes Aaltonen
Paavo Johannes Aaltonen
Paavo Johannes Aaltonen
Ragnhild Margrethe Aamodt
Alf Lied Aanning

Exercise 4

Display only the names of the athletes who won a medal at the Summer Olympics, without any duplicate

Solution
mlr --icsv --opprint filter '$season == "Summer" && $medal != "NA"' then uniq -f name then head olympics.csv

Output:

name
Edgar Lindenau Aabye
Arvo Ossian Aaltonen
Paavo Johannes Aaltonen
Ragnhild Margrethe Aamodt
Alf Lied Aanning
Willemien Aardenburg
Pepijn Aardewijn
Ann Kristin Aarnes
Karl Jan Aas
Thomas Valentin Aas

Exercise 5

What is the number of unique athletes who won at least one medal at the Summer Olympics?

Solution
mlr --icsv --opprint filter '$season == "Summer" && $medal != "NA"' then uniq -f name then count olympics.csv

Output:

count
24545

Exercise 6

Who are the 10 athletes with the most medals at the Summer Olympics?

Solution
mlr --icsv --opprint filter '$season == "Summer" && $medal != "NA"' then most-frequent -f name olympics.csv

Output:

name                                            count
Michael Fred Phelps, II                         28
Larysa Semenivna Latynina (Diriy-)              18
Nikolay Yefimovich Andrianov                    15
Borys Anfiyanovych Shakhlin                     13
Edoardo Mangiarotti                             13
Takashi Ono                                     13
Natalie Anne Coughlin (-Hall)                   12
Birgit Fischer-Schmidt                          12
Jennifer Elisabeth "Jenny" Thompson (-Cumpelik) 12
Paavo Johannes Nurmi                            12

Exercise 7

Who was the oldest and the youngest athlete to compete in any Olympic?

Solution
# oldest
mlr --icsv --opprint filter '$age != "NA"' then top -f age -a olympics.csv

# youngest  
mlr --icsv --opprint filter '$age != "NA"' then top -f age -a --min olympics.csv

Output:

# oldest
id     name                   sex age height weight team          noc games       year season city      sport            event                                       medal
128719 John Quincy Adams Ward M   97  NA     NA     United States USA 1928 Summer 1928 Summer Amsterdam Art Competitions Art Competitions Mixed Sculpturing, Statues NA

# youngest  
id    name               sex age height weight team                          noc games       year season city   sport      event                                 medal
71691 Dimitrios Loundras M   10  NA     NA     Ethnikos Gymnastikos Syllogos GRE 1896 Summer 1896 Summer Athina Gymnastics Gymnastics Men's Parallel Bars, Teams Bronze

Exercise 8

What are the top 10 teams with the most medals in any Olympic?

Solution
mlr --icsv --opprint filter '$medal != "NA"' then most-frequent -f team olympics.csv

Output:

team          count
United States 5219
Soviet Union  2451
Germany       1984
Great Britain 1673
France        1550
Italy         1527
Sweden        1434
Australia     1306
Canada        1243
Hungary       1127

Exercise 9

What is the average number of medals per year for each team?

Solution
mlr --icsv --opprint filter '$medal != "NA"' then stats1 -a count -f medal -g team,year then stats1 -a mean -f medal_count -g team then sort -nr medal_count_mean then head olympics.csv

Output:

team          medal_count_mean
Unified Team  271
Soviet Union  245.1
East Germany  156.83333333333334
United States 149.11428571428573
West Germany  93
Russia        79.28571428571429
Germany       76.3076923076923
China         60.06666666666667
Great Britain 47.8
Italy         46.27272727272727

Exercise 10

In which Summer Olympic edition did France won the most medals?

Solution
mlr --icsv --opprint filter '$medal != "NA" && $season == "Summer" && $team == "France"' then stats1 -a count -f medal -g year then top -f medal_count -a olympics.csv

Output:

year medal_count
1920 134

Exercise 11

For each team, what is the number of Gold, Silver, and Bronze medals?

Solution
mlr --icsv --opprint filter '$medal != "NA"' \
                then cut -f team,medal \
                then stats1 -a count -f medal -g team,medal \
                then reshape -s medal,medal_count \
                then unsparsify \
                then head olympics.csv

Output:

team           Gold Bronze Silver
Denmark/Sweden 6    -      -
Finland        198  415    263
Norway         299  281    330
Netherlands    277  390    321
Taifun         5    -      -
France         455  577    518
Italy          535  484    508
Spain          108  136    239
Azerbaijan     7    25     12
Russia         366  393    351

Exercise 12

How many Judo athletes were present at the 2012 Summer Olympics?

Solution
mlr --c2p filter '$sport == "Judo" && $year == 2012' then uniq -f name -n  olympics.csv

Output:

count
384

Exercise 13

Who are the 10 athletes with the most participations to any Olympic event (there could be several event per Olympics)?

Solution
mlr --c2p stats1 -a count -f name -g name then top -f name_count -n 10 -a olympics.csv

Output:

name                             name_count
Robert Tait McKenzie             58
Heikki Ilmari Savolainen         39
Joseph "Josy" Stoffel            38
Ioannis Theofilakis              36
Takashi Ono                      33
Alexandros Theofilakis           32
Andreas Wecker                   32
Jean Lucien Nicolas Jacoby       32
Alfrd (Arnold-) Hajs (Guttmann-) 32
Johann "Hans" Sauter             31

Exercise 14

Who is the athlete with the most participations in a single Olympic, and in which year?

Solution
mlr --c2p stats1 -a count -f name -g name,year then top -f name_count -a olympics.csv

Output:

name                 year name_count
Robert Tait McKenzie 1932 44

Exercise 15

Who is the athlete who participated to the most Olympic editions, and how many editions did they participate to?

Solution
mlr --c2p uniq -g name,year \
    then stats1 -a count -f name -g name \
    then top -a -f name_count olympics.csv

Output:

name       name_count
Ian Millar 10

Exercise 16

Which cities hosted several times the Olympics?

Solution
mlr --c2p --from olympics.csv uniq -f city,year \
                         then count -g city \
                         then filter '$count > 1'

Output:

city         count
London       3
Paris        2
Los Angeles  2
Lake Placid  2
Stockholm    2
Athina       3
Innsbruck    2
Sankt Moritz 2

Exercise 17

Which Olympics had the highest proportion of female athletes?

Solution
mlr --c2p --from olympics.csv uniq -g id,year,sex \
                       then stats1 -a count -f sex -g year,sex \
                       then reshape -s sex,sex_count \
                       then unsparsify \
                       then put '$percentF = 100 * $F / ($F + $M)' \
                       then cut -f year,percentF\
                       then top -f percentF -n 10 -a

Output:

year percentF
2016 45.03086143662224
2012 44.25216316440049
2008 42.288283328745756
2010 40.73343848580441
2004 40.73126835275173
2014 40.145719489981786
2006 38.291900561347234
2000 38.20794590025359
2002 36.93205502292622
1998 36.209270307480494

Exercise 18

How many people with the first name "Quentin" won an Olympic medal?

Solution
mlr --c2p --from olympics.csv filter "$name =~ "^Quentin" && $medal != "NA"' then count

Output:

count
0

sad...

Exercise 19

Print the cumulative number of medals per year that Michael Phelps (Michael Fred Phelps, II) won during his Olympic carrer.

Solution
mlr --c2p --from olympics.csv filter '$name == "Michael Fred Phelps, II"' then \
                              put '$had_medal = ($medal != "NA") ? 1 : 0' then \
                              stats1 -a sum -f had_medal -g year then \
                              step -a rsum -f had_medal_sum

Output:

year had_medal_sum had_medal_sum_rsum
2000 0             0
2004 8             8
2008 8             16
2012 6             22
2016 6             28

Exercise 20

What is the age, height, and weight of the average Olympian and of the median Olympian?

Solution
mlr --c2p --from olympics.csv cut -f age,height,weight then\
                              filter '$age != "NA" && $height != "NA" && $weight != "NA"' then\
                              summary -a mean,median

Output:

field_name mean               median
age        25.055508937016466 24
height     175.3719496519778  175
weight     70.68833701161691  70

About

Exercises to practice the Miller (`mlr`) tool

Topics

Resources

Stars

Watchers

Forks

Languages