Note: There are often multiple ways to answer each question.
MASS
and dplyr
packages. Load the nlschools
dataset.#install.packages("MASS") (uncomment this line to install the package)
library(MASS)
library(dplyr)
data(nlschools)
nlschools
dataset? Why is the class
column a factor and not a numeric variable? Use some of the functions we learned to get a feel for the data.?nlschools
str(nlschools)
## 'data.frame': 2287 obs. of 6 variables:
## $ lang : int 46 45 33 46 20 30 30 57 36 36 ...
## $ IQ : num 15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
## $ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ GS : int 29 29 29 29 29 29 29 29 29 29 ...
## $ SES : int 23 10 15 23 10 10 23 10 13 15 ...
## $ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
summary(nlschools)
## lang IQ class GS SES
## Min. : 9.00 Min. : 4.00 15580 : 33 Min. :10.00 Min. :10.00
## 1st Qu.:35.00 1st Qu.:10.50 5480 : 31 1st Qu.:23.00 1st Qu.:20.00
## Median :42.00 Median :12.00 15980 : 31 Median :27.00 Median :27.00
## Mean :40.93 Mean :11.83 16180 : 31 Mean :26.51 Mean :27.81
## 3rd Qu.:48.00 3rd Qu.:13.00 18380 : 31 3rd Qu.:31.00 3rd Qu.:35.00
## Max. :58.00 Max. :18.00 5580 : 30 Max. :39.00 Max. :50.00
## (Other):2100
## COMB
## 0:1658
## 1: 629
##
##
##
##
##
head(nlschools)
## lang IQ class GS SES COMB
## 1 46 15.0 180 29 23 0
## 2 45 14.5 180 29 10 0
## 3 33 9.5 180 29 15 0
## 4 46 11.0 180 29 23 0
## 5 20 8.0 180 29 10 0
## 6 30 9.5 180 29 10 0
class
is not a numeric variable as it represents class ID. The IDs do not have any meaningful ordering to them.
nrow(nlschools)
## [1] 2287
nlschools %>% filter(IQ >= 17.5)
## lang IQ class GS SES COMB
## 1 51 18.0 2980 22 45 1
## 2 51 18.0 5480 32 50 0
## 3 51 17.5 5580 32 45 0
## 4 49 17.5 6280 26 40 0
## 5 54 17.5 6280 26 30 0
## 6 53 17.5 6280 26 50 0
## 7 50 17.5 9480 25 33 0
## 8 51 17.5 15980 33 20 0
## 9 50 17.5 16080 26 40 0
## 10 51 18.0 18480 29 23 0
## 11 51 18.0 19780 35 40 1
## 12 54 17.5 21880 27 50 0
## 13 51 17.5 22780 30 27 0
nlschools %>% filter(class == 2980 & SES < 37)
## lang IQ class GS SES COMB
## 1 44 14.0 2980 22 35 1
## 2 39 6.0 2980 22 35 1
## 3 41 12.5 2980 22 35 1
nlschools %>% filter(lang > 50) %>%
summarize(count = n())
## count
## 1 360
nlschools %>% group_by(class) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 133 x 2
## class count
## <fct> <int>
## 1 15580 33
## 2 5480 31
## 3 15980 31
## 4 16180 31
## 5 18380 31
## 6 5580 30
## 7 11580 30
## 8 19980 30
## 9 14880 29
## 10 3880 28
## # … with 123 more rows
Class 15580 had the most number of students (33).
pass
which takes on the value “pass” if lang >= 40, “fail” otherwise. Save the dataset with the new column in a variable nlschools2
, then show the first 10 rows of the dataset. (Hint: The ifelse
function will be handy.)nlschools2 <- nlschools %>% mutate(pass = ifelse(lang >= 40, "pass", "fail"))
head(nlschools2, n = 10)
## lang IQ class GS SES COMB pass
## 1 46 15.0 180 29 23 0 pass
## 2 45 14.5 180 29 10 0 pass
## 3 33 9.5 180 29 15 0 fail
## 4 46 11.0 180 29 23 0 pass
## 5 20 8.0 180 29 10 0 fail
## 6 30 9.5 180 29 10 0 fail
## 7 30 9.5 180 29 23 0 fail
## 8 57 13.0 180 29 10 0 pass
## 9 36 9.5 180 29 13 0 fail
## 10 36 11.0 180 29 15 0 fail
nlschools3 <- nlschools %>% group_by(SES) %>%
summarize(mean_IQ = mean(IQ),
mean_lang = mean(lang)) %>%
arrange(desc(SES))
head(nlschools3)
## # A tibble: 6 x 3
## SES mean_IQ mean_lang
## <int> <dbl> <dbl>
## 1 50 13.1 46.6
## 2 48 13.3 48.8
## 3 47 12.3 45.3
## 4 45 13.0 44.7
## 5 43 12.5 44.7
## 6 40 12.6 44.4
library(ggplot2)
ggplot(nlschools3) +
geom_point(aes(x = SES, y = mean_IQ))
ggplot(nlschools3) +
geom_point(aes(x = SES, y = mean_lang))
sample_n
function in the dplyr
package.)set.seed(100)
nlschools %>% sample_n(size = 10)
## lang IQ class GS SES COMB
## 1 40 10.0 6180 27 25 0
## 2 51 17.5 22780 30 27 0
## 3 26 8.0 6081 26 33 1
## 4 28 11.5 22280 26 40 0
## 5 51 11.5 17680 24 20 0
## 6 50 12.0 10180 25 33 0
## 7 49 10.0 13780 32 18 1
## 8 20 12.0 3380 14 15 1
## 9 42 11.0 17580 27 23 1
## 10 54 12.0 15580 34 33 0
set.seed
is a way to reset the random number generator so that every time we run the next line, we get the same random sample. This is helpful in trying to reproduce random results.