This is a short data analysis of U.S. education spending and its impact on academic achievement The data was taken from https://www.kaggle.com/noriuk/us-education-datasets-unification-project.
Given that education is a critical factor in meeting a country’s economic and development goals, we aim to gain insight into how a state’s spending on education influences the quality of the state’s education system. In order to answer our main question, we will explore:
Our hypothesis is that higher spending will lead to higher test scores, although we are not sure what this relationship is going to exactly look like (e.g., will 2x spending result in 2x higher test scores? Or, is there a point of diminishing returns where education spending is optimized?).
Package imports:
library(tidyverse)
library(knitr)
Read in dataset:
df <- read.csv("/Users/nicolebirkner/Desktop/states_all.csv")
Here is a glimpse of the dataset, including the first few rows of the dataset, number of rows, number of columns and column names.
nrow(df)
## [1] 1492
ncol(df)
## [1] 25
head(df)
## PRIMARY_KEY STATE YEAR ENROLL TOTAL_REVENUE FEDERAL_REVENUE
## 1 1992_ALABAMA ALABAMA 1992 NA 2678885 304177
## 2 1992_ALASKA ALASKA 1992 NA 1049591 106780
## 3 1992_ARIZONA ARIZONA 1992 NA 3258079 297888
## 4 1992_ARKANSAS ARKANSAS 1992 NA 1711959 178571
## 5 1992_CALIFORNIA CALIFORNIA 1992 NA 26260025 2072470
## 6 1992_COLORADO COLORADO 1992 NA 3185173 163253
## STATE_REVENUE LOCAL_REVENUE TOTAL_EXPENDITURE INSTRUCTION_EXPENDITURE
## 1 1659028 715680 2653798 1481703
## 2 720711 222100 972488 498362
## 3 1369815 1590376 3401580 1435908
## 4 958785 574603 1743022 964323
## 5 16546514 7641041 27138832 14358922
## 6 1307986 1713934 3264826 1642466
## SUPPORT_SERVICES_EXPENDITURE OTHER_EXPENDITURE
## 1 735036 NA
## 2 350902 NA
## 3 1007732 NA
## 4 483488 NA
## 5 8520926 NA
## 6 1035970 NA
## CAPITAL_OUTLAY_EXPENDITURE GRADES_PK_G GRADES_KG_G GRADES_4_G GRADES_8_G
## 1 174053 8224 55460 57948 58025
## 2 37451 2371 10152 9748 8789
## 3 609114 2544 53497 55433 49081
## 4 145212 808 33511 34632 36011
## 5 2044688 59067 431763 418418 363296
## 6 364760 7410 47588 50648 45025
## GRADES_12_G GRADES_1_8_G GRADES_9_12_G GRADES_ALL_G AVG_MATH_4_SCORE
## 1 41167 471564 196386 676174 208.3279
## 2 6714 79117 30847 112335 NA
## 3 37410 437127 175210 614881 215.2539
## 4 27651 281338 123113 405259 210.2060
## 5 270675 3286034 1372011 4717112 208.3990
## 6 34533 394904 160299 562613 221.0234
## AVG_MATH_8_SCORE AVG_READING_4_SCORE AVG_READING_8_SCORE
## 1 252.1875 207.9635 NA
## 2 NA NA 258.8597
## 3 265.3663 206.2127 262.1699
## 4 256.3121 208.6345 264.6197
## 5 260.8922 196.7644 NA
## 6 272.3984 213.4809 266.4197
Here are the column names of the dataset:
names(df)
## [1] "PRIMARY_KEY" "STATE"
## [3] "YEAR" "ENROLL"
## [5] "TOTAL_REVENUE" "FEDERAL_REVENUE"
## [7] "STATE_REVENUE" "LOCAL_REVENUE"
## [9] "TOTAL_EXPENDITURE" "INSTRUCTION_EXPENDITURE"
## [11] "SUPPORT_SERVICES_EXPENDITURE" "OTHER_EXPENDITURE"
## [13] "CAPITAL_OUTLAY_EXPENDITURE" "GRADES_PK_G"
## [15] "GRADES_KG_G" "GRADES_4_G"
## [17] "GRADES_8_G" "GRADES_12_G"
## [19] "GRADES_1_8_G" "GRADES_9_12_G"
## [21] "GRADES_ALL_G" "AVG_MATH_4_SCORE"
## [23] "AVG_MATH_8_SCORE" "AVG_READING_4_SCORE"
## [25] "AVG_READING_8_SCORE"
Each row in this dataset is data for one state over a single year. Therefore, all keys should be unique.
length(unique(df$PRIMARY_KEY))
## [1] 1487
length(df$PRIMARY_KEY) - length(unique(df$PRIMARY_KEY))
## [1] 5
We noticed that there are five duplicated keys in our dataset. Thus, we proceeded to remove the repeated rows using the distinct
function:
df <- distinct(df, PRIMARY_KEY, .keep_all = TRUE)
We decided to take a closer looks at expenditure and enrollment in 2013, 2014 and 2015 because those are the three most recent dates with complete data for both of those variables. To start our analysis, we filtered our data:
df_2013 <- df %>% filter(YEAR == "2013")
df_2014 <- df %>% filter(YEAR == "2014")
df_2015 <- df %>% filter(YEAR == "2015")
Then, we ran a regression for expenditure and number of students enrolled for 2013, 2014 and 2015. As we would expect, there is an upward trend in spending as a state’s enrollment increases. This makes sense because as the number of students goes up, the state requires additonal resources to support the students.
lm.df2013 <- lm(df_2013$TOTAL_EXPENDITURE ~ df_2013$ENROLL)
summary(lm.df2013)
##
## Call:
## lm(formula = df_2013$TOTAL_EXPENDITURE ~ df_2013$ENROLL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8159119 -1851440 -701528 14381 28596690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.856e+05 9.567e+05 0.717 0.477
## df_2013$ENROLL 1.187e+01 6.514e-01 18.226 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5222000 on 49 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.8715, Adjusted R-squared: 0.8688
## F-statistic: 332.2 on 1 and 49 DF, p-value: < 2.2e-16
beta = lm.df2013$coefficients
ggplot(df_2013, aes(x = ENROLL, y = TOTAL_EXPENDITURE)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red") + geom_text(aes(label=ifelse(ENROLL > 1500000, as.character(STATE), '')), hjust =1, vjust =0 ) + labs(title = "2013 Expenditure vs. Total Student Enrollment", x = "Total Enrollment", y = "Expenditure")
lm.df2014 <- lm(df_2014$TOTAL_EXPENDITURE ~ df_2014$ENROLL)
summary(lm.df2014)
##
## Call:
## lm(formula = df_2014$TOTAL_EXPENDITURE ~ df_2014$ENROLL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8240261 -1926568 -623524 56093 29808494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.107e+05 9.797e+05 0.623 0.536
## df_2014$ENROLL 1.227e+01 6.650e-01 18.450 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5353000 on 49 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8742, Adjusted R-squared: 0.8716
## F-statistic: 340.4 on 1 and 49 DF, p-value: < 2.2e-16
beta = lm.df2014$coefficients
ggplot(df_2014, aes(x = ENROLL, y = TOTAL_EXPENDITURE)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red") + geom_text(aes(label=ifelse(ENROLL > 1500000, as.character(STATE), '')), hjust =1, vjust =0 ) + labs(title = "2014 Expenditure vs. Total Student Enrollment", x = "Total Enrollment", y = "Expenditure")
lm.df2015 <- lm(df_2015$TOTAL_EXPENDITURE ~ df_2015$ENROLL)
summary(lm.df2015)
##
## Call:
## lm(formula = df_2015$TOTAL_EXPENDITURE ~ df_2015$ENROLL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9122224 -2007825 -577219 280979 30522154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.155e+05 1.016e+06 0.409 0.684
## df_2015$ENROLL 1.298e+01 6.867e-01 18.901 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5555000 on 49 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.8794, Adjusted R-squared: 0.8769
## F-statistic: 357.2 on 1 and 49 DF, p-value: < 2.2e-16
beta = lm.df2015$coefficients
ggplot(df_2015, aes(x = ENROLL, y = TOTAL_EXPENDITURE)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red") + geom_text(aes(label=ifelse(ENROLL > 1500000, as.character(STATE), '')), hjust =1, vjust =0 ) + labs(title = "2015 Expenditure vs. Total Student Enrollment", x = "Total Enrollment", y = "Expenditure")
Therefore, we can conclude that there is a strong positive correlation between number of students enrolled and total expenditure in education.
The next section will analyze how the number of students enrolled in school per state has changed over time. The following graph shows change of enrollment over time per state, for all 50 states.
df_selected <- df %>% select(STATE, YEAR, ENROLL)
selected_complete <- df_selected[complete.cases(df_selected),]
ggplot(data = selected_complete, mapping = aes(x = YEAR, y = ENROLL/1000000, col = STATE)) + geom_point() + labs(title = "School enrollment per state", x = "Year", y = "Number of students enrolled (millions)") + theme(legend.title = element_text(size = 4), legend.text = element_text(size = 4))
Since the graph shown above is a bit cluttered with all 50 states, we decided to create a graph of only the top four states in terms of student enrollment (in this case, California, Texas, New York and Florida).
df_top4 <- df %>% select(STATE, YEAR, ENROLL) %>% filter(STATE == "CALIFORNIA" | STATE == "TEXAS" | STATE == "NEW_YORK" | STATE == "FLORIDA")
top4_complete <- df_top4[complete.cases(df_top4),]
ggplot(data = top4_complete, mapping = aes(x = YEAR, y = ENROLL/1000000, col = STATE)) + geom_point() + labs(title = "School enrollment per state", x = "Year", y = "Number of students enrolled (millions)") + theme(legend.title = element_text(size = 10), legend.text = element_text(size = 10))
It’s interesting to see that while California, Florida and Texas enrollment numbers have increased, New York enrollment numbers have been declining in the past few years.
Our next step is to figure out how much money these top four states are spending on their education systems per year.
df_top <- df %>% select(STATE, YEAR, ENROLL, INSTRUCTION_EXPENDITURE, TOTAL_EXPENDITURE) %>% filter(STATE == "CALIFORNIA" | STATE == "TEXAS" | STATE == "NEW_YORK" | STATE == "FLORIDA")
top_complete <- df_top[complete.cases(df_top),]
ggplot(data = top_complete, mapping = aes(x = YEAR, y = TOTAL_EXPENDITURE/1000000, col = STATE)) + geom_point() + labs(title = "Top 4 State's Total Expenditure Over Time ", x = "Year", y = "Total Expenditure (millions of $'s)")
From the plot we can see that California spends the most on its education system, followed by New York, Texas, then Florida. It is interesting to see that in 2015 New York had the 4th highest student enrollment while spending the second highest amount in education. We decided to dig a little deeper into how education spending affected the state’s quality of education. To do this we added a new column to our dataframe that contained the sum of each state’s average test scores across the four categories that were recorded in the data set. Our aim was to see if there was a correlation between amount spent and test scores.
Now, we also want to analyze how test scores have changed over time. Therefore, we created a new column that adds up all the scores (math 4, math 8, reading 4, reading 8), and graphed those over time.
scores <- df %>% mutate(TOTAL_SCORES = df$AVG_MATH_4_SCORE + df$AVG_MATH_8_SCORE + df$AVG_READING_4_SCORE + df$AVG_READING_8_SCORE) %>% filter(STATE == "CALIFORNIA" | STATE == "TEXAS" | STATE == "NEW_YORK"| STATE == "FLORIDA")
scores_selected <- scores %>% select(STATE, YEAR, ENROLL, TOTAL_EXPENDITURE, INSTRUCTION_EXPENDITURE, TOTAL_SCORES)
scoresselected_complete <- scores_selected[complete.cases(scores_selected),]
ggplot(data = scoresselected_complete, mapping = aes(x = YEAR, y = TOTAL_SCORES, col = STATE)) + geom_line() + labs(title = "Test Scores Over Time", x = "Year", y = "Total Scores")
What is interesting here is that although California is the highest spender, it’s test scores lag behind the other three states considered in this analysis. Another point to note is that Florida has seen dramatic improvements in their test scores since 1992, improving by 10% in this time frame.
We decided to measure the relationship between test scores and expenditure on instruction. To do this, we created a new column that found the percentage of a state’s expenditure that was spent on instruction. Then, we plotted a linear regression with test scores regressed on instruction expenditure.
df_2015new <- df_2015 %>% mutate(PERCENT = df_2015$INSTRUCTION_EXPENDITURE/df_2015$TOTAL_EXPENDITURE) %>% mutate(TOTAL_SCORES = df_2015$AVG_MATH_4_SCORE + df_2015$AVG_MATH_8_SCORE + df_2015$AVG_READING_4_SCORE + df_2015$AVG_READING_8_SCORE)
lm.df2015new <- lm(df_2015new$TOTAL_SCORES ~ df_2015new$PERCENT)
summary(lm.df2015new)
##
## Call:
## lm(formula = df_2015new$TOTAL_SCORES ~ df_2015new$PERCENT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.151 -14.920 1.611 9.534 34.307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 941.17 31.00 30.360 <2e-16 ***
## df_2015new$PERCENT 131.37 59.76 2.198 0.0327 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.82 on 49 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.08977, Adjusted R-squared: 0.0712
## F-statistic: 4.833 on 1 and 49 DF, p-value: 0.03268
beta = lm.df2015new$coefficients
ggplot(df_2015new, aes(x = PERCENT, y = TOTAL_SCORES)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red") + labs(title = "Total Scores vs. Percent of Instructor Expenditure", x = "% of Instructor Expenditure", y = "Total Scores")
There doesn’t seem to be a very strong correlation between scores and percent of instructor expenditure, as we expected. However, even though it’s a weak correlation it still seems to be positive. That is, as the % spent on instructors increases, the scores increase too.
Through our analysis of US education data we were able to answer each of the questions listed at the beginning of this report.
Summary of results:
Are expenditure and student enrollment correlated? Our regression of 2013, 2014, and 2015 US enrollment vs education expense data illustrated that there does in fact exist a statistically significant relationship between a state’s level of expenditure on education and their student enrollment.
How have the number of students in school per state changed over time? Our analysis of the change of student enrollment over time showed that some states, like California, Florida and Texas, have increased in terms of student enrollment over the past years, but New York has remained more constant, even showing a slight dip.
How has spending in the previously mentioned four states changed over time? Spending in those four states has increased as student enrollment increased. It is interesting to see that in 2015 New York shows the 4th highest student enrollment while spending the second highest amount in education. In addition, we were able to see at which points the national recession affected each of the states, by the dip in education expenditure.
How have test scores changed over time? (For a select number of states) We also analyzed how test scores have changed over time for those four states. It’s interesting to see that although California is the highest spender, it’s test scores lag behind the other three states. Another point to note is that Florida has seen dramatic improvements in their test scores since 1992.
Are expenditure and test scores positively or negatively correlated? Finally, we ran a regression between test scores and percent of instructor expenditure (that is, instructor expenditure divided by total expenditure). We found that altough there is not a very strong correlation, the correlation is still positive and significant.
Next steps:
Scores, enrollment and spending are all trending upward through time. For the next steps in our analysis, we would like to investigate whether this is a result of general population increase, or if children who were previously not schooled are now receiving education.