Introduction

US Education Data Analysis

This is a short data analysis of U.S. education spending and its impact on academic achievement The data was taken from https://www.kaggle.com/noriuk/us-education-datasets-unification-project.

Given that education is a critical factor in meeting a country’s economic and development goals, we aim to gain insight into how a state’s spending on education influences the quality of the state’s education system. In order to answer our main question, we will explore:

  1. Are expenditure and student enrollment correlated?
  2. How have the number of students in school per state changed over time?
  3. How has spending in school changed over time? (For a select number of states)
  4. How have test scores changed over time? (For a select number of states)
  5. Are expenditure and test scores positively or negatively correlated?

Our hypothesis is that higher spending will lead to higher test scores, although we are not sure what this relationship is going to exactly look like (e.g., will 2x spending result in 2x higher test scores? Or, is there a point of diminishing returns where education spending is optimized?).

Package imports:

library(tidyverse)
library(knitr)

Read in dataset:

df <- read.csv("/Users/nicolebirkner/Desktop/states_all.csv")

Getting a feel for the data

Here is a glimpse of the dataset, including the first few rows of the dataset, number of rows, number of columns and column names.

nrow(df)
## [1] 1492
ncol(df)
## [1] 25
head(df)
##       PRIMARY_KEY      STATE YEAR ENROLL TOTAL_REVENUE FEDERAL_REVENUE
## 1    1992_ALABAMA    ALABAMA 1992     NA       2678885          304177
## 2     1992_ALASKA     ALASKA 1992     NA       1049591          106780
## 3    1992_ARIZONA    ARIZONA 1992     NA       3258079          297888
## 4   1992_ARKANSAS   ARKANSAS 1992     NA       1711959          178571
## 5 1992_CALIFORNIA CALIFORNIA 1992     NA      26260025         2072470
## 6   1992_COLORADO   COLORADO 1992     NA       3185173          163253
##   STATE_REVENUE LOCAL_REVENUE TOTAL_EXPENDITURE INSTRUCTION_EXPENDITURE
## 1       1659028        715680           2653798                 1481703
## 2        720711        222100            972488                  498362
## 3       1369815       1590376           3401580                 1435908
## 4        958785        574603           1743022                  964323
## 5      16546514       7641041          27138832                14358922
## 6       1307986       1713934           3264826                 1642466
##   SUPPORT_SERVICES_EXPENDITURE OTHER_EXPENDITURE
## 1                       735036                NA
## 2                       350902                NA
## 3                      1007732                NA
## 4                       483488                NA
## 5                      8520926                NA
## 6                      1035970                NA
##   CAPITAL_OUTLAY_EXPENDITURE GRADES_PK_G GRADES_KG_G GRADES_4_G GRADES_8_G
## 1                     174053        8224       55460      57948      58025
## 2                      37451        2371       10152       9748       8789
## 3                     609114        2544       53497      55433      49081
## 4                     145212         808       33511      34632      36011
## 5                    2044688       59067      431763     418418     363296
## 6                     364760        7410       47588      50648      45025
##   GRADES_12_G GRADES_1_8_G GRADES_9_12_G GRADES_ALL_G AVG_MATH_4_SCORE
## 1       41167       471564        196386       676174         208.3279
## 2        6714        79117         30847       112335               NA
## 3       37410       437127        175210       614881         215.2539
## 4       27651       281338        123113       405259         210.2060
## 5      270675      3286034       1372011      4717112         208.3990
## 6       34533       394904        160299       562613         221.0234
##   AVG_MATH_8_SCORE AVG_READING_4_SCORE AVG_READING_8_SCORE
## 1         252.1875            207.9635                  NA
## 2               NA                  NA            258.8597
## 3         265.3663            206.2127            262.1699
## 4         256.3121            208.6345            264.6197
## 5         260.8922            196.7644                  NA
## 6         272.3984            213.4809            266.4197

Here are the column names of the dataset:

names(df)
##  [1] "PRIMARY_KEY"                  "STATE"                       
##  [3] "YEAR"                         "ENROLL"                      
##  [5] "TOTAL_REVENUE"                "FEDERAL_REVENUE"             
##  [7] "STATE_REVENUE"                "LOCAL_REVENUE"               
##  [9] "TOTAL_EXPENDITURE"            "INSTRUCTION_EXPENDITURE"     
## [11] "SUPPORT_SERVICES_EXPENDITURE" "OTHER_EXPENDITURE"           
## [13] "CAPITAL_OUTLAY_EXPENDITURE"   "GRADES_PK_G"                 
## [15] "GRADES_KG_G"                  "GRADES_4_G"                  
## [17] "GRADES_8_G"                   "GRADES_12_G"                 
## [19] "GRADES_1_8_G"                 "GRADES_9_12_G"               
## [21] "GRADES_ALL_G"                 "AVG_MATH_4_SCORE"            
## [23] "AVG_MATH_8_SCORE"             "AVG_READING_4_SCORE"         
## [25] "AVG_READING_8_SCORE"

Each row in this dataset is data for one state over a single year. Therefore, all keys should be unique.

length(unique(df$PRIMARY_KEY))
## [1] 1487
length(df$PRIMARY_KEY) - length(unique(df$PRIMARY_KEY))
## [1] 5

We noticed that there are five duplicated keys in our dataset. Thus, we proceeded to remove the repeated rows using the distinct function:

df <- distinct(df, PRIMARY_KEY, .keep_all = TRUE)

Correlation between student enrollment and expenditure

We decided to take a closer looks at expenditure and enrollment in 2013, 2014 and 2015 because those are the three most recent dates with complete data for both of those variables. To start our analysis, we filtered our data:

df_2013 <- df %>% filter(YEAR == "2013")
df_2014 <- df %>% filter(YEAR == "2014")
df_2015 <- df %>% filter(YEAR == "2015")

Then, we ran a regression for expenditure and number of students enrolled for 2013, 2014 and 2015. As we would expect, there is an upward trend in spending as a state’s enrollment increases. This makes sense because as the number of students goes up, the state requires additonal resources to support the students.

lm.df2013 <- lm(df_2013$TOTAL_EXPENDITURE ~ df_2013$ENROLL)
summary(lm.df2013)
## 
## Call:
## lm(formula = df_2013$TOTAL_EXPENDITURE ~ df_2013$ENROLL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8159119 -1851440  -701528    14381 28596690 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.856e+05  9.567e+05   0.717    0.477    
## df_2013$ENROLL 1.187e+01  6.514e-01  18.226   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5222000 on 49 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.8715, Adjusted R-squared:  0.8688 
## F-statistic: 332.2 on 1 and 49 DF,  p-value: < 2.2e-16
beta = lm.df2013$coefficients

ggplot(df_2013, aes(x = ENROLL, y = TOTAL_EXPENDITURE)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red") + geom_text(aes(label=ifelse(ENROLL > 1500000, as.character(STATE), '')), hjust =1, vjust =0 ) + labs(title = "2013 Expenditure vs. Total Student Enrollment", x = "Total Enrollment", y = "Expenditure")

lm.df2014 <- lm(df_2014$TOTAL_EXPENDITURE ~ df_2014$ENROLL)
summary(lm.df2014)
## 
## Call:
## lm(formula = df_2014$TOTAL_EXPENDITURE ~ df_2014$ENROLL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8240261 -1926568  -623524    56093 29808494 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.107e+05  9.797e+05   0.623    0.536    
## df_2014$ENROLL 1.227e+01  6.650e-01  18.450   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5353000 on 49 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8742, Adjusted R-squared:  0.8716 
## F-statistic: 340.4 on 1 and 49 DF,  p-value: < 2.2e-16
beta = lm.df2014$coefficients

ggplot(df_2014, aes(x = ENROLL, y = TOTAL_EXPENDITURE)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red") + geom_text(aes(label=ifelse(ENROLL > 1500000, as.character(STATE), '')), hjust =1, vjust =0 ) + labs(title = "2014 Expenditure vs. Total Student Enrollment", x = "Total Enrollment", y = "Expenditure")

lm.df2015 <- lm(df_2015$TOTAL_EXPENDITURE ~ df_2015$ENROLL)
summary(lm.df2015)
## 
## Call:
## lm(formula = df_2015$TOTAL_EXPENDITURE ~ df_2015$ENROLL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9122224 -2007825  -577219   280979 30522154 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.155e+05  1.016e+06   0.409    0.684    
## df_2015$ENROLL 1.298e+01  6.867e-01  18.901   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5555000 on 49 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.8794, Adjusted R-squared:  0.8769 
## F-statistic: 357.2 on 1 and 49 DF,  p-value: < 2.2e-16
beta = lm.df2015$coefficients

ggplot(df_2015, aes(x = ENROLL, y = TOTAL_EXPENDITURE)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red") + geom_text(aes(label=ifelse(ENROLL > 1500000, as.character(STATE), '')), hjust =1, vjust =0 ) + labs(title = "2015 Expenditure vs. Total Student Enrollment", x = "Total Enrollment", y = "Expenditure")

Therefore, we can conclude that there is a strong positive correlation between number of students enrolled and total expenditure in education.

Deep Dive: Enrollment

The next section will analyze how the number of students enrolled in school per state has changed over time. The following graph shows change of enrollment over time per state, for all 50 states.

df_selected <- df %>% select(STATE, YEAR, ENROLL)
selected_complete <- df_selected[complete.cases(df_selected),]
ggplot(data = selected_complete, mapping = aes(x = YEAR, y = ENROLL/1000000, col = STATE)) + geom_point() + labs(title = "School enrollment per state", x = "Year", y = "Number of students enrolled (millions)") + theme(legend.title = element_text(size = 4), legend.text = element_text(size = 4))

Since the graph shown above is a bit cluttered with all 50 states, we decided to create a graph of only the top four states in terms of student enrollment (in this case, California, Texas, New York and Florida).

df_top4 <- df %>% select(STATE, YEAR, ENROLL) %>% filter(STATE == "CALIFORNIA" | STATE == "TEXAS" | STATE == "NEW_YORK" | STATE == "FLORIDA")
top4_complete <- df_top4[complete.cases(df_top4),]
ggplot(data = top4_complete, mapping = aes(x = YEAR, y = ENROLL/1000000, col = STATE)) + geom_point() + labs(title = "School enrollment per state", x = "Year", y = "Number of students enrolled (millions)") + theme(legend.title = element_text(size = 10), legend.text = element_text(size = 10))

It’s interesting to see that while California, Florida and Texas enrollment numbers have increased, New York enrollment numbers have been declining in the past few years.

Deep Dive: Expenditures

Our next step is to figure out how much money these top four states are spending on their education systems per year.

df_top <- df %>% select(STATE, YEAR, ENROLL, INSTRUCTION_EXPENDITURE, TOTAL_EXPENDITURE) %>% filter(STATE == "CALIFORNIA" | STATE == "TEXAS" | STATE == "NEW_YORK" | STATE == "FLORIDA")
top_complete <- df_top[complete.cases(df_top),]
ggplot(data = top_complete, mapping = aes(x = YEAR, y = TOTAL_EXPENDITURE/1000000, col = STATE)) + geom_point() + labs(title = "Top 4 State's Total Expenditure Over Time ", x = "Year", y = "Total Expenditure (millions of $'s)")

From the plot we can see that California spends the most on its education system, followed by New York, Texas, then Florida. It is interesting to see that in 2015 New York had the 4th highest student enrollment while spending the second highest amount in education. We decided to dig a little deeper into how education spending affected the state’s quality of education. To do this we added a new column to our dataframe that contained the sum of each state’s average test scores across the four categories that were recorded in the data set. Our aim was to see if there was a correlation between amount spent and test scores.

Deep Dive: Test Scores

Now, we also want to analyze how test scores have changed over time. Therefore, we created a new column that adds up all the scores (math 4, math 8, reading 4, reading 8), and graphed those over time.

scores <- df %>% mutate(TOTAL_SCORES = df$AVG_MATH_4_SCORE + df$AVG_MATH_8_SCORE + df$AVG_READING_4_SCORE + df$AVG_READING_8_SCORE) %>% filter(STATE == "CALIFORNIA" | STATE == "TEXAS" | STATE == "NEW_YORK"| STATE == "FLORIDA")

scores_selected <- scores %>% select(STATE, YEAR, ENROLL, TOTAL_EXPENDITURE, INSTRUCTION_EXPENDITURE, TOTAL_SCORES) 

scoresselected_complete <- scores_selected[complete.cases(scores_selected),]

ggplot(data = scoresselected_complete, mapping = aes(x = YEAR, y = TOTAL_SCORES, col = STATE)) + geom_line() + labs(title = "Test Scores Over Time", x = "Year", y = "Total Scores")

What is interesting here is that although California is the highest spender, it’s test scores lag behind the other three states considered in this analysis. Another point to note is that Florida has seen dramatic improvements in their test scores since 1992, improving by 10% in this time frame.

Correlation between test scores and expenditure

We decided to measure the relationship between test scores and expenditure on instruction. To do this, we created a new column that found the percentage of a state’s expenditure that was spent on instruction. Then, we plotted a linear regression with test scores regressed on instruction expenditure.

df_2015new <- df_2015 %>% mutate(PERCENT = df_2015$INSTRUCTION_EXPENDITURE/df_2015$TOTAL_EXPENDITURE) %>% mutate(TOTAL_SCORES = df_2015$AVG_MATH_4_SCORE + df_2015$AVG_MATH_8_SCORE + df_2015$AVG_READING_4_SCORE + df_2015$AVG_READING_8_SCORE)

lm.df2015new <- lm(df_2015new$TOTAL_SCORES ~ df_2015new$PERCENT)
summary(lm.df2015new)
## 
## Call:
## lm(formula = df_2015new$TOTAL_SCORES ~ df_2015new$PERCENT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.151 -14.920   1.611   9.534  34.307 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          941.17      31.00  30.360   <2e-16 ***
## df_2015new$PERCENT   131.37      59.76   2.198   0.0327 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.82 on 49 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.08977,    Adjusted R-squared:  0.0712 
## F-statistic: 4.833 on 1 and 49 DF,  p-value: 0.03268
beta = lm.df2015new$coefficients

ggplot(df_2015new, aes(x = PERCENT, y = TOTAL_SCORES)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red") + labs(title = "Total Scores vs. Percent of Instructor Expenditure", x = "% of Instructor Expenditure", y = "Total Scores")

There doesn’t seem to be a very strong correlation between scores and percent of instructor expenditure, as we expected. However, even though it’s a weak correlation it still seems to be positive. That is, as the % spent on instructors increases, the scores increase too.

Conclusion

Through our analysis of US education data we were able to answer each of the questions listed at the beginning of this report.

Summary of results:

  1. Are expenditure and student enrollment correlated? Our regression of 2013, 2014, and 2015 US enrollment vs education expense data illustrated that there does in fact exist a statistically significant relationship between a state’s level of expenditure on education and their student enrollment.

  2. How have the number of students in school per state changed over time? Our analysis of the change of student enrollment over time showed that some states, like California, Florida and Texas, have increased in terms of student enrollment over the past years, but New York has remained more constant, even showing a slight dip.

  3. How has spending in the previously mentioned four states changed over time? Spending in those four states has increased as student enrollment increased. It is interesting to see that in 2015 New York shows the 4th highest student enrollment while spending the second highest amount in education. In addition, we were able to see at which points the national recession affected each of the states, by the dip in education expenditure.

  4. How have test scores changed over time? (For a select number of states) We also analyzed how test scores have changed over time for those four states. It’s interesting to see that although California is the highest spender, it’s test scores lag behind the other three states. Another point to note is that Florida has seen dramatic improvements in their test scores since 1992.

  5. Are expenditure and test scores positively or negatively correlated? Finally, we ran a regression between test scores and percent of instructor expenditure (that is, instructor expenditure divided by total expenditure). We found that altough there is not a very strong correlation, the correlation is still positive and significant.

Next steps:

Scores, enrollment and spending are all trending upward through time. For the next steps in our analysis, we would like to investigate whether this is a result of general population increase, or if children who were previously not schooled are now receiving education.