library(tidyverse)
library(lubridate)
df <- read_csv("cricket_data.csv")
This dataset is much bigger and messier that the datasets in previous labs, so the outputs of various dataframe operations will not be displayed in this document. You shoud run the code in RStudio to get an interactive view of all of the tables.
This dataset contains a hoge number of cricketers from all over the world. Our first goal is to find all Australian players who played in the men’s or women’s national team Let’s start simply by finding all players who played in Australia by filtering according to the COUNTRY
variable.
df %>%
filter(COUNTRY == "Australia")
Look at the Major teams
variable! There are many players in this dataset that played in minor or youth teams.
To restrict to international players, we might look at only those players who have batted in “first-class” (that is, top-league) matches:
df %>%
filter(COUNTRY == "Australia") %>%
filter(`BATTING_First-class_Mat` > 0)
This isn’t really what we want either. It turns out that first-class matches count the interstate leagues, so this also gives us players who have never played in an international match.
Another idea is that the Major teams
variable lists all the teams that a player has played for. The ones that we are looking for are Australia
and Australia Women
, so it might make sense to look for rows with a Major teams
variable that contains the string "Australia"
.
For this, we use the grepl()
function, which checks if a string is contained in another string and returns TRUE
or FALSE
. For example, we have the following:
grepl("hello", c("Hello", "hello", "hi", "hellooooo"))
## [1] FALSE TRUE FALSE TRUE
Hence, to pick out the rows with the string "Australia"
contained in the Major teams
variable, we would do the following:
df %>%
filter(COUNTRY == "Australia") %>%
filter(grepl("Australia", `Major teams`))
Ok, so this includes teams like "Western Australia"
and "Australia A"
, so we haven’t filtered the dataset properly.
To be more precise, we want to identify those rows whose Major teams
variable either
"Australia"
or "Australia Women"
(if the player has only played on one team), or"'Australia, '"
or "'Australia Women, '"
(if the player has played on multiple teams).Hence, our filtering operation has to be more complicated, but we can make it work with the |
(or) operator.
df_aus <- df %>%
filter(COUNTRY == "Australia") %>%
filter(
`Major teams` == "Australia" |
`Major teams` == "Australia Women" |
grepl("'Australia,'", `Major teams`) |
grepl("'Australia Women,'", `Major teams`)
)
Since we now have only internation players in the men’s and women’s leagues, let’s label which player playes in which league. One again, grepl()
is our friend here, but we will need another tool.
Namely, the if_else()
functon takes three arguments – a vector of TRUE
/FALSE
values, the value to use in the TRUE
case, and the value to use in the FALSE
case, eg.
v <- c(1, -2, -3, 0, 4)
if_else(v < 0, "Negative", "Non-negative")
## [1] "Non-negative" "Negative" "Negative" "Non-negative" "Non-negative"
In the league case, we instead use
df_aus <- df_aus %>%
mutate(league = if_else(grepl("Women", `Major teams`), "Women's", "Men's"))
Notice that our data frame has a Birthdate
variable, and that it’s a total mess. Some dates are in the dd/mm/yy
format, where others (seemingly the pre 1900) ones are in yyyy-mm-dd
.
Fortunately, lubridate
’s parse_date_time()
function is very good at resolving such issues, even when there are several formats all mixed together.
Observe the following simple code, and examins the Birthdate
variable to convince yourself that it worked.
df_aus %>%
mutate(Birthdate = parse_date_time(Birthdate, c("dmy", "Ymd")))
Hang on, though - maybe it didn’t work! It’s 2021, so it’s reasonable to assume that there may be international cricketers born after 2000, so a birth date of 29/04/01
may be 2001-04-29
as easily as 1901-04-29
, and due to the historic nature of these data, both cases probably occur in the table! What a disaster!
One way of resolving this is to check if the player is alive or not using the Died
variable. If they’re alive, and the birth year is a two-digit number below 10, it’s reasonable to assume that that year is in the 21st century (a comprehensive list of living Australian 111-year-olds is publically available, so you can check manually if you aren’t convinced). On the other hand, if a player is dead, they were probably born in the 20th century.
Look at the help documentation for parse_date_time2()
to understand how the cutoff_2000
argument works, and what it’s doing in the following code:
df_aus <- df_aus %>%
mutate(
Birthdate =
if_else(
Died == "Dead",
parse_date_time2(Birthdate, c("dmy", "Ymd"), cutoff_2000 = -1),
parse_date_time2(Birthdate, c("dmy", "Ymd"), cutoff_2000 = 10)
),
birth_year = year(Birthdate)
)
I’m sure that you’ve noticed that this is all incredibly hacky, but real data is like that, unfortunately, and sometimes we have to hold our noses and do unsavory things.
Age
variable, but it is outdated. Compute a corrected Age
variable.Next, we’ll isolate the state in which the players were born. A look at the Birthplace
variable indicates that it is formatted as <city>, <state>
, except for major cities, in which case the suburb is also included, so the format is <suburb>, <city>, <state>
(in Australia, suburbs are generally considered to be part of their nearest major city).
Using the separate()
function, we can split the Birthplace
variable into three, inserting a split wherever we see a ,
. If there are only two pieces in the string, these are the city and the state, so we use the fill =
argument of separate()
to convey that, if there are fewer than three segment, the left ones should be filled with missing values.
Another issue is that some birth places and in ", Australia"
, which we would like to simply delete. For this, we use the gsub()
function, which replaces a part of one string with something else.
v <- c(
"I have not seen such a long string in a long, long time",
"the long and the short of it")
gsub("long", "short", v)
## [1] "I have not seen such a short string in a short, short time"
## [2] "the short and the short of it"
In our case, we are trying to replace the substring ", Australia"
falling at the end of the full string (this is denoted by putting a $
character at the end of the substring. For more details and fancier tricks, look up regular expressions) with an empty string, thus deleting it.
df_aus <- df_aus %>%
mutate(Birthplace = gsub(", Australia$", "", Birthplace)) %>%
separate(
Birthplace,
c("birth_suburb", "birth_city", "birth_state"),
sep = ", ",
fill = "left",
remove = FALSE)
## Warning: Expected 3 pieces. Additional pieces discarded in 7 rows [99, 129, 145,
## 228, 230, 348, 497].
Let’s have a look at how successful this was by looking as all of the birth states that we found.
df_aus %>%
group_by(birth_state) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
print(n = 40)
## # A tibble: 36 x 2
## birth_state count
## <chr> <int>
## 1 New South Wales 217
## 2 Victoria 190
## 3 South Australia 75
## 4 Western Australia 72
## 5 Queensland 62
## 6 <NA> 26
## 7 Tasmania 24
## 8 England 9
## 9 Sydney 5
## 10 Perth 4
## 11 Australian Capital Territory 3
## 12 India 3
## 13 Ireland 2
## 14 New Zealand 2
## 15 Northern Territory 2
## 16 Scotland 2
## 17 South Africa 2
## 18 Yorkshire 2
## 19 Brisbane 1
## 20 Ceylon 1
## 21 circa 1
## 22 Devon 1
## 23 Fitzroy 1
## 24 France 1
## 25 Gloucestershire 1
## 26 Hervey Bay 1
## 27 Malaysia 1
## 28 Melbourne 1
## 29 New South WAles 1
## 30 North-West Frontier Province 1
## 31 Northern Rhodesia 1
## 32 Otago 1
## 33 Pakistan 1
## 34 Portugal 1
## 35 Singapore 1
## 36 Zimbabwe 1
Hmmmmm, not so good! There’s a bunch of missing data, many players who were born outside Australia, and so can’t really be properly assigned a birth state, and worst of all, some errors, where Australian cities were impoperly parsed as states.
Ok, so first, let’s look at the source of the missing data:
df_aus %>%
filter(is.na(birth_state)) %>%
select(NAME, Birthplace, birth_state)
## # A tibble: 26 x 3
## NAME Birthplace birth_state
## <chr> <chr> <chr>
## 1 Cameron Bancroft <NA> <NA>
## 2 Leonie Callaghan <NA> <NA>
## 3 Gary Duperouzel <NA> <NA>
## 4 A Dolin <NA> <NA>
## 5 Molly Flaherty <NA> <NA>
## 6 Patricia Holmes <NA> <NA>
## 7 TR Hogarth <NA> <NA>
## 8 Anne Gordon <NA> <NA>
## 9 Laura Harris <NA> <NA>
## 10 Grace Harris <NA> <NA>
## # ... with 16 more rows
In all of these cases, the Birthplace
variable is missing. If we were really consciencious, we would at this point go and look up the birthplaces of all of these players, but since we can’t fill this in from other parts of the table, let’s leave it as-is.
There remain some errors to be corrected, though. A number of Australian cities appear in the birth_state
field due to improper enccoding of the data. In addition, there is an incorrect capitalization of New South WAles
in one entry.
Both these issues can be solved with the recode()
function, which replaces one character value in a vector with another.
Last, there is an entry titled circa
. This is just an error, and should be a missing value instead. We fix this with the na_if()
function, which designates certain strings as missing values.
df_aus <- df_aus %>%
mutate(
birth_state = recode(
birth_state,
`New South WAles` = "New South Wales",
Sydney = "New South Wales",
Melbourne = "Victoria",
Fitzroy = "Victoria",
Brisbane = "Queensland",
`Hervey Bay` = "Queensland",
Perth = "Western Australia"
),
birth_state = na_if(birth_state, "circa")
)
Now let’s see if this works
df_aus %>%
group_by(birth_state) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
print(n = 40)
## # A tibble: 28 x 2
## birth_state count
## <chr> <int>
## 1 New South Wales 223
## 2 Victoria 192
## 3 Western Australia 76
## 4 South Australia 75
## 5 Queensland 64
## 6 <NA> 27
## 7 Tasmania 24
## 8 England 9
## 9 Australian Capital Territory 3
## 10 India 3
## 11 Ireland 2
## 12 New Zealand 2
## 13 Northern Territory 2
## 14 Scotland 2
## 15 South Africa 2
## 16 Yorkshire 2
## 17 Ceylon 1
## 18 Devon 1
## 19 France 1
## 20 Gloucestershire 1
## 21 Malaysia 1
## 22 North-West Frontier Province 1
## 23 Northern Rhodesia 1
## 24 Otago 1
## 25 Pakistan 1
## 26 Portugal 1
## 27 Singapore 1
## 28 Zimbabwe 1
Definitely looks a lot better!
Major teams
variable better? Try to implement a solution.Now that we have a clean-ish dataset, let’s put it to use. One thing that we ight want to look at is the years in which players were born.
Let’s collate this information by counting up the number of players born in each league in each year:
df_births <- df_aus %>%
group_by(league, birth_year) %>%
summarize(count = n())
## `summarise()` has grouped output by 'league'. You can override using the `.groups` argument.
With this summarized table, we can make a corresponding plot. The stat_smooth()
function adds a smooth interpolating curve, which hopefully gives a better idea about how the response variable changes over time.
df_births %>%
ggplot(aes(x = birth_year, y = count, color = league)) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
We can repeat this plot with all kinds of variables. For example, test batting average denoted the number of runs a player has scored divided by the number of times that they have been out in test matches (the longest format of cricket, with matches lasting five days).
df_aus %>%
ggplot(aes(x = birth_year, y = BATTING_Tests_Ave, color = league)) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 128 rows containing non-finite values (stat_smooth).
## Warning: Removed 128 rows containing missing values (geom_point).
Another example is identifying correlations betweeen vairables. For example, we might correlate test batting averate with test bowling average (a bowler’s bowling average is the number of runs that batsmen have scored from that bowler’s balls divided by the number of times that bowler has got batsmen out, so a low bowling average is good).
In this case, we are looking for correlation, which is a measure of linear association. Hence, it’s better to fit a straight line to the data than a smooth curve. To do this, we specify the smoothing method as "lm"
, which stands for linear model:
df_aus %>%
ggplot(aes(x = BOWLING_Tests_Ave, y = BATTING_Tests_Ave, color = league)) +
geom_point() +
stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.7.10 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.5
## [5] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 tibble_3.1.0
## [9] ggplot2_3.3.3 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.22 lattice_0.20-41 splines_4.0.4
## [5] haven_2.3.1 colorspace_2.0-0 vctrs_0.3.6 generics_0.1.0
## [9] htmltools_0.5.1.1 mgcv_1.8-33 yaml_2.2.1 utf8_1.2.1
## [13] rlang_0.4.10 pillar_1.5.1 glue_1.4.2 withr_2.4.1
## [17] DBI_1.1.1 dbplyr_2.1.0 modelr_0.1.8 readxl_1.3.1
## [21] lifecycle_1.0.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0
## [25] rvest_1.0.0 evaluate_0.14 labeling_0.4.2 knitr_1.31
## [29] fansi_0.4.2 highr_0.8 broom_0.7.5 Rcpp_1.0.6
## [33] scales_1.1.1 backports_1.2.1 jsonlite_1.7.2 farver_2.1.0
## [37] fs_1.5.0 hms_1.0.0 digest_0.6.27 stringi_1.5.3
## [41] grid_4.0.4 cli_2.3.1 tools_4.0.4 magrittr_2.0.1
## [45] crayon_1.4.1 pkgconfig_2.0.3 Matrix_1.3-2 ellipsis_0.3.1
## [49] xml2_1.3.2 reprex_1.0.0 assertthat_0.2.1 rmarkdown_2.7
## [53] httr_1.4.2 rstudioapi_0.13 R6_2.5.0 nlme_3.1-152
## [57] compiler_4.0.4