library(tidyverse)
library(lubridate)

Australian cricketing statistics

df <- read_csv("cricket_data.csv")

This dataset is much bigger and messier that the datasets in previous labs, so the outputs of various dataframe operations will not be displayed in this document. You shoud run the code in RStudio to get an interactive view of all of the tables.

Filtering the dataset

This dataset contains a hoge number of cricketers from all over the world. Our first goal is to find all Australian players who played in the men’s or women’s national team Let’s start simply by finding all players who played in Australia by filtering according to the COUNTRY variable.

df %>%
    filter(COUNTRY == "Australia")

Look at the Major teams variable! There are many players in this dataset that played in minor or youth teams.

To restrict to international players, we might look at only those players who have batted in “first-class” (that is, top-league) matches:

df %>%
    filter(COUNTRY == "Australia") %>%
    filter(`BATTING_First-class_Mat` > 0)

This isn’t really what we want either. It turns out that first-class matches count the interstate leagues, so this also gives us players who have never played in an international match.

Another idea is that the Major teams variable lists all the teams that a player has played for. The ones that we are looking for are Australia and Australia Women, so it might make sense to look for rows with a Major teams variable that contains the string "Australia".

For this, we use the grepl() function, which checks if a string is contained in another string and returns TRUE or FALSE. For example, we have the following:

grepl("hello", c("Hello", "hello", "hi", "hellooooo"))

## [1] FALSE  TRUE FALSE  TRUE

Hence, to pick out the rows with the string "Australia" contained in the Major teams variable, we would do the following:

df %>%
    filter(COUNTRY == "Australia") %>%
    filter(grepl("Australia", `Major teams`))

Ok, so this includes teams like "Western Australia" and "Australia A", so we haven’t filtered the dataset properly.

To be more precise, we want to identify those rows whose Major teams variable either

is exactly equal to "Australia" or "Australia Women" (if the player has only played on one team), or
contains the string "'Australia, '" or "'Australia Women, '" (if the player has played on multiple teams).

Hence, our filtering operation has to be more complicated, but we can make it work with the | (or) operator.

df_aus <- df %>%
    filter(COUNTRY == "Australia") %>%
    filter(
        `Major teams` == "Australia" |
            `Major teams` == "Australia Women" |
            grepl("'Australia,'", `Major teams`) |
            grepl("'Australia Women,'", `Major teams`)
        )

Since we now have only internation players in the men’s and women’s leagues, let’s label which player playes in which league. One again, grepl() is our friend here, but we will need another tool.

Namely, the if_else() functon takes three arguments – a vector of TRUE/FALSE values, the value to use in the TRUE case, and the value to use in the FALSE case, eg.

v <- c(1, -2, -3, 0, 4)
if_else(v < 0, "Negative", "Non-negative")

## [1] "Non-negative" "Negative"     "Negative"     "Non-negative" "Non-negative"

In the league case, we instead use

df_aus <- df_aus %>%
    mutate(league = if_else(grepl("Women", `Major teams`), "Women's", "Men's"))

Getting the birthday right

Notice that our data frame has a Birthdate variable, and that it’s a total mess. Some dates are in the dd/mm/yy format, where others (seemingly the pre 1900) ones are in yyyy-mm-dd.

Fortunately, lubridate’s parse_date_time() function is very good at resolving such issues, even when there are several formats all mixed together.

Observe the following simple code, and examins the Birthdate variable to convince yourself that it worked.

df_aus %>%
    mutate(Birthdate = parse_date_time(Birthdate, c("dmy", "Ymd")))

Hang on, though - maybe it didn’t work! It’s 2021, so it’s reasonable to assume that there may be international cricketers born after 2000, so a birth date of 29/04/01 may be 2001-04-29 as easily as 1901-04-29, and due to the historic nature of these data, both cases probably occur in the table! What a disaster!

One way of resolving this is to check if the player is alive or not using the Died variable. If they’re alive, and the birth year is a two-digit number below 10, it’s reasonable to assume that that year is in the 21st century (a comprehensive list of living Australian 111-year-olds is publically available, so you can check manually if you aren’t convinced). On the other hand, if a player is dead, they were probably born in the 20th century.

Look at the help documentation for parse_date_time2() to understand how the cutoff_2000 argument works, and what it’s doing in the following code:

df_aus <- df_aus %>%
    mutate(
        Birthdate = 
            if_else(
                Died == "Dead",
                parse_date_time2(Birthdate, c("dmy", "Ymd"), cutoff_2000 = -1),
                parse_date_time2(Birthdate, c("dmy", "Ymd"), cutoff_2000 = 10)
            ),
        birth_year = year(Birthdate)
    )

I’m sure that you’ve noticed that this is all incredibly hacky, but real data is like that, unfortunately, and sometimes we have to hold our noses and do unsavory things.

Exercises

Do the same thing for death date. What problems arise that were not present in the birth date example?
The dataframe contains an Age variable, but it is outdated. Compute a corrected Age variable.
How would you check the assumption that all the 19th century birth years were correctly categorized?

Birth state

Next, we’ll isolate the state in which the players were born. A look at the Birthplace variable indicates that it is formatted as <city>, <state>, except for major cities, in which case the suburb is also included, so the format is <suburb>, <city>, <state> (in Australia, suburbs are generally considered to be part of their nearest major city).

Using the separate() function, we can split the Birthplace variable into three, inserting a split wherever we see a ,. If there are only two pieces in the string, these are the city and the state, so we use the fill = argument of separate() to convey that, if there are fewer than three segment, the left ones should be filled with missing values.

Another issue is that some birth places and in ", Australia", which we would like to simply delete. For this, we use the gsub() function, which replaces a part of one string with something else.

v <- c(
    "I have not seen such a long string in a long, long time",
    "the long and the short of it")
gsub("long", "short",  v)

## [1] "I have not seen such a short string in a short, short time"
## [2] "the short and the short of it"

In our case, we are trying to replace the substring ", Australia" falling at the end of the full string (this is denoted by putting a $ character at the end of the substring. For more details and fancier tricks, look up regular expressions) with an empty string, thus deleting it.

df_aus <- df_aus %>%
    mutate(Birthplace = gsub(", Australia$", "", Birthplace)) %>%
    separate(
        Birthplace,
        c("birth_suburb", "birth_city", "birth_state"),
        sep = ", ",
        fill = "left",
        remove = FALSE)

## Warning: Expected 3 pieces. Additional pieces discarded in 7 rows [99, 129, 145,
## 228, 230, 348, 497].

Let’s have a look at how successful this was by looking as all of the birth states that we found.

df_aus %>%
    group_by(birth_state) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    print(n = 40)

## # A tibble: 36 x 2
##    birth_state                  count
##    <chr>                        <int>
##  1 New South Wales                217
##  2 Victoria                       190
##  3 South Australia                 75
##  4 Western Australia               72
##  5 Queensland                      62
##  6 <NA>                            26
##  7 Tasmania                        24
##  8 England                          9
##  9 Sydney                           5
## 10 Perth                            4
## 11 Australian Capital Territory     3
## 12 India                            3
## 13 Ireland                          2
## 14 New Zealand                      2
## 15 Northern Territory               2
## 16 Scotland                         2
## 17 South Africa                     2
## 18 Yorkshire                        2
## 19 Brisbane                         1
## 20 Ceylon                           1
## 21 circa                            1
## 22 Devon                            1
## 23 Fitzroy                          1
## 24 France                           1
## 25 Gloucestershire                  1
## 26 Hervey Bay                       1
## 27 Malaysia                         1
## 28 Melbourne                        1
## 29 New South WAles                  1
## 30 North-West Frontier Province     1
## 31 Northern Rhodesia                1
## 32 Otago                            1
## 33 Pakistan                         1
## 34 Portugal                         1
## 35 Singapore                        1
## 36 Zimbabwe                         1

Hmmmmm, not so good! There’s a bunch of missing data, many players who were born outside Australia, and so can’t really be properly assigned a birth state, and worst of all, some errors, where Australian cities were impoperly parsed as states.

Ok, so first, let’s look at the source of the missing data:

df_aus %>%
    filter(is.na(birth_state)) %>%
    select(NAME, Birthplace, birth_state)

## # A tibble: 26 x 3
##    NAME             Birthplace birth_state
##    <chr>            <chr>      <chr>      
##  1 Cameron Bancroft <NA>       <NA>       
##  2 Leonie Callaghan <NA>       <NA>       
##  3 Gary Duperouzel  <NA>       <NA>       
##  4 A Dolin          <NA>       <NA>       
##  5 Molly Flaherty   <NA>       <NA>       
##  6 Patricia Holmes  <NA>       <NA>       
##  7 TR Hogarth       <NA>       <NA>       
##  8 Anne Gordon      <NA>       <NA>       
##  9 Laura Harris     <NA>       <NA>       
## 10 Grace Harris     <NA>       <NA>       
## # ... with 16 more rows

In all of these cases, the Birthplace variable is missing. If we were really consciencious, we would at this point go and look up the birthplaces of all of these players, but since we can’t fill this in from other parts of the table, let’s leave it as-is.

There remain some errors to be corrected, though. A number of Australian cities appear in the birth_state field due to improper enccoding of the data. In addition, there is an incorrect capitalization of New South WAles in one entry.

Both these issues can be solved with the recode() function, which replaces one character value in a vector with another.

Last, there is an entry titled circa. This is just an error, and should be a missing value instead. We fix this with the na_if() function, which designates certain strings as missing values.

df_aus <- df_aus %>% 
    mutate(
        birth_state = recode(
            birth_state,
            `New South WAles` = "New South Wales",
            Sydney = "New South Wales",
            Melbourne = "Victoria",
            Fitzroy = "Victoria",
            Brisbane = "Queensland",
            `Hervey Bay` = "Queensland",
            Perth = "Western Australia"
        ),
        birth_state = na_if(birth_state, "circa")
    )

Now let’s see if this works

df_aus %>%
    group_by(birth_state) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    print(n = 40)

## # A tibble: 28 x 2
##    birth_state                  count
##    <chr>                        <int>
##  1 New South Wales                223
##  2 Victoria                       192
##  3 Western Australia               76
##  4 South Australia                 75
##  5 Queensland                      64
##  6 <NA>                            27
##  7 Tasmania                        24
##  8 England                          9
##  9 Australian Capital Territory     3
## 10 India                            3
## 11 Ireland                          2
## 12 New Zealand                      2
## 13 Northern Territory               2
## 14 Scotland                         2
## 15 South Africa                     2
## 16 Yorkshire                        2
## 17 Ceylon                           1
## 18 Devon                            1
## 19 France                           1
## 20 Gloucestershire                  1
## 21 Malaysia                         1
## 22 North-West Frontier Province     1
## 23 Northern Rhodesia                1
## 24 Otago                            1
## 25 Pakistan                         1
## 26 Portugal                         1
## 27 Singapore                        1
## 28 Zimbabwe                         1

Definitely looks a lot better!

Eercises

How might you make the Major teams variable better? Try to implement a solution.

Plotting variables

Now that we have a clean-ish dataset, let’s put it to use. One thing that we ight want to look at is the years in which players were born.

Let’s collate this information by counting up the number of players born in each league in each year:

df_births <- df_aus %>%
    group_by(league, birth_year) %>%
    summarize(count = n())

## `summarise()` has grouped output by 'league'. You can override using the `.groups` argument.

With this summarized table, we can make a corresponding plot. The stat_smooth() function adds a smooth interpolating curve, which hopefully gives a better idea about how the response variable changes over time.

df_births %>%
    ggplot(aes(x = birth_year, y = count, color = league)) + 
    geom_point() + 
    stat_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

Exercises

There’s a big problem in the above plot. Namely, years with in which no players were born are represented not as 0 counts, but as missing data. This has a big impact on the smoothed curves. Implement a solution to this problem.
Create a histogram version of the above visualization.
Repeat this analysis for birth state. Think especially about how to address the problem of the unusual birthplaces.

We can repeat this plot with all kinds of variables. For example, test batting average denoted the number of runs a player has scored divided by the number of times that they have been out in test matches (the longest format of cricket, with matches lasting five days).

df_aus %>%
    ggplot(aes(x = birth_year, y = BATTING_Tests_Ave, color = league)) + 
    geom_point() + 
    stat_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 128 rows containing non-finite values (stat_smooth).

## Warning: Removed 128 rows containing missing values (geom_point).

Exercises

Investigate the outliers in the above plots. Which are data abberations, and which are Australia’s greatest heroes?
Explain why this plot doesn’t suffer from the “zeroed out NAs” problem discussed in the previous set of exercises.

Another example is identifying correlations betweeen vairables. For example, we might correlate test batting averate with test bowling average (a bowler’s bowling average is the number of runs that batsmen have scored from that bowler’s balls divided by the number of times that bowler has got batsmen out, so a low bowling average is good).

In this case, we are looking for correlation, which is a measure of linear association. Hence, it’s better to fit a straight line to the data than a smooth curve. To do this, we specify the smoothing method as "lm", which stands for linear model:

df_aus %>%
    ggplot(aes(x = BOWLING_Tests_Ave, y = BATTING_Tests_Ave, color = league)) + 
    geom_point() + 
    stat_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 366 rows containing non-finite values (stat_smooth).

## Warning: Removed 366 rows containing missing values (geom_point).

Exercises

Is is fair to say that there is a positive correlation between test batting and test bowling averagy in the men’s leage and a negative correlation in the women’s league?
Which points most influence the correlation in the above plot?
Repeat the previous analysis, but restrict your focus to players who have players over a certain number of tests. Do the same with players with players born after a certain date.
Repeat the analysis for bith state
How does this correlation change over time? How would you even compute or visualize that?
Find variables that you expect to have positive or negative correlations. Are you right?

Session info

sessionInfo()

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.7.10 forcats_0.5.1    stringr_1.4.0    dplyr_1.0.5     
##  [5] purrr_0.3.4      readr_1.4.0      tidyr_1.1.3      tibble_3.1.0    
##  [9] ggplot2_3.3.3    tidyverse_1.3.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0  xfun_0.22         lattice_0.20-41   splines_4.0.4    
##  [5] haven_2.3.1       colorspace_2.0-0  vctrs_0.3.6       generics_0.1.0   
##  [9] htmltools_0.5.1.1 mgcv_1.8-33       yaml_2.2.1        utf8_1.2.1       
## [13] rlang_0.4.10      pillar_1.5.1      glue_1.4.2        withr_2.4.1      
## [17] DBI_1.1.1         dbplyr_2.1.0      modelr_0.1.8      readxl_1.3.1     
## [21] lifecycle_1.0.0   munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0 
## [25] rvest_1.0.0       evaluate_0.14     labeling_0.4.2    knitr_1.31       
## [29] fansi_0.4.2       highr_0.8         broom_0.7.5       Rcpp_1.0.6       
## [33] scales_1.1.1      backports_1.2.1   jsonlite_1.7.2    farver_2.1.0     
## [37] fs_1.5.0          hms_1.0.0         digest_0.6.27     stringi_1.5.3    
## [41] grid_4.0.4        cli_2.3.1         tools_4.0.4       magrittr_2.0.1   
## [45] crayon_1.4.1      pkgconfig_2.0.3   Matrix_1.3-2      ellipsis_0.3.1   
## [49] xml2_1.3.2        reprex_1.0.0      assertthat_0.2.1  rmarkdown_2.7    
## [53] httr_1.4.2        rstudioapi_0.13   R6_2.5.0          nlme_3.1-152     
## [57] compiler_4.0.4

08-Data parsing and stat graphics

April 29, 2020

Australian cricketing statistics

Filtering the dataset

Getting the birthday right

Exercises

Birth state

Eercises

Plotting variables

Exercises

Exercises

Exercises

Session info