Chapter 2 Cleaning the Data
2.1 Importing Packages & Data
First, we import a number of packages with convenient functions:
library(tidyverse) # helps data processing
library(DBI) # allow sql operations in R
library(readxl) # allow creation of data.frame from xlsx
library(cli) # color console outputs for easy reading
library(lubridate) # helps date processing
library(stringr) # help string/substring processing
library(uuid) # generate unique identifiers
library(ggplot2) # generate visualizations
library(ggrepel) # labelling for ggplots
Then we import the data:
# Library database usage
az_database_list <- read.csv('raw_data/az_database_list.csv')
az_database_usage <- read.csv('raw_data/database_usage_by_month.csv')
# U of I research outputs from VERSO
research_outputs <- read.csv('raw_data/research-outputs-by-ay-and-college.csv')
# Enrollment numbers from RI dashboards
spring_enrollment <- read_xlsx('raw_data/spring-enrollment_by_college-major-and-semester.xlsx')
fall_enrollment <- read_xlsx('raw_data/fall-enrollment_by_college-major-and-semester.xlsx')
# Grant expenditures from research report archives
grant_funding <- read_xlsx('raw_data/funding-reports-combined_by_college-and-fy.xlsx')
2.2 Exploring the Data
Since the data is raw and messy the first thing we want to do is get a sense of the columns and data types in our new data.frames.
View(az_database_list) # view data.frame
for (i in colnames(az_database_list)){ # get list of columns and their data types.
cat(col_blue(i), ': ', col_red(az_database_list[[i]]), "\n") # color-code the output for easy reading
}
View(az_database_usage)
for (i in colnames(az_database_usage)){
cat(col_blue(i), ': ', col_red(az_database_usage[[i]]), "\n")
}
View(fall_enrollment)
for (i in colnames(fall_enrollment)){
cat(col_blue(i), ': ', col_red(fall_enrollment[[i]]), "\n")
}
View(spring_enrollment)
for (i in colnames(spring_enrollment)){
cat(col_blue(i), ': ', col_red(spring_enrollment[[i]]), "\n")
}
View(grant_funding)
for (i in colnames(grant_funding)){
cat(col_blue(i), ': ', col_red(grant_funding[[i]]), "\n")
}
View(research_outputs)
for (i in colnames(research_outputs)){
cat(col_blue(i), ': ', col_red(research_outputs[[i]]), "\n")
}
Technically, this didn’t need to be a for loop
. But, as a Python native, at this point in the project I wasn’t used to R’s innate vectorization and the apply family functions. Now I leave them in to remember my roots and where I come from. :D
2.3 Data Cleaning
With that done, we can start the data cleaning. First, by removing columns we obviously do not need and re-naming the ones we do need to avoid issues with special characters in the future:
az_database_list <- select(az_database_list,
id = "ID",
name = "Name",
description = "Description",
date_created = "Created",
vendor = "Vendor",
subjects = "Subjects")
fall_enrollment <- select(fall_enrollment,
college = "College",
term = "Term Description",
enrollment = "Count (Data suppression) 1")
spring_enrollment <- select(spring_enrollment,
college = "College",
term = "Term Description",
enrollment = "Count (Data suppression) 1")
grant_funding <- select(grant_funding,
fiscal_year = "fiscal-year",
college,
grant_proposals = "Number of proposals",
grant_funding = "research expenditures")
From there, we can begin cleaning the data sources one by one. The processing for each of these was done in one big pipe chain, but to make it more readable, I’ve broken it into discrete chunks. The original pipe chains can be found in 03-preprocess-raw-data.R.
2.3.1 The Research Outputs Data.frame.
The ultimate goal for this data.frame is to have a data.frame that counts the number of publications each college published each fiscal year. So, our observational unit “Fiscal Year + College”. E.g., “The College of Engineering produced X publications in fiscal year 2020”.
This was my first mistake with tidiness! Technically, concatenating fiscal year and college into one column is untidy. However, as we’ll see during the “visualization” section, this error does not actually make processing the data much more difficult.
Within the research outputs data.frame, each research asset is affiliated with one or more colleges based on the affiliation(s) of the author(s). Unfortunately, each affiliation is listed in its own column. Based on looking at the data.frame colnames, we can tell assets have between 1 and 7 columns dedicated to their affiliations. We need to collapse those into one column that contains a vector of all the asset’s affiliations:
print(unique(unlist(research_outputs[9:16]))) # used to determine whether empty cells contained "", NA's, NULLs, or other values
research_outputs <- research_outputs %>%
mutate(unit_affiliations =
lapply(1:nrow(research_outputs), function(i){
unique(research_outputs[i, 9:16][research_outputs[i, 9:16] != ""]) # != condition based on print statement above
})
) %>%
select(asset_id = "Asset.Id",
pub_year = "Asset.Published.Date..String.",
type = "Asset.Type",
unit_affiliations) # remove the 7 columns that were previously dedicated to asset affiliations
###
# debug
print(unique(unlist(research_outputs$unit_affiliations))) # check to make sure all expected Colleges are somewhere in the new unit affiliations column
###
Next, we need to determine which fiscal year the assets were published in because many University level variables are reported based on fiscal year and we want the asset data to mesh with that.
To do so, we need to clean the “pub_year” column, creating lubridate-friendly values. Unfortunately, dates are stored in a variety of formats:
- YYYY
- DD/MM/YYYY
- MM/DD/YYYY
- YY-MMM (where MMM is a 3 letter abbreviation such as “Jan”)
- MMM-YY
Huge thanks to https://regex101.com/ and https://regex-generator.olafneumann.org/ for helping me create and debug the regular expressions that uniquely pick out the date formats seen in the data.frame!
%>%
mutate(
pub_year_cleaned = case_when( # use regex to parse different date formats into lubridate friendly strings
str_detect(pub_year, "^\\d{4}$") ~ mdy(paste0('01/01/',pub_year)),
str_detect(pub_year, "^(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[0-2])/(\\d{4}$)") ~ dmy(pub_year),
str_detect(pub_year, "^([1-9]|1[12])/([1-9]|[12][0-9]|3[01])/(19|20)\\d{2}$") ~ mdy(pub_year),
str_detect(pub_year, "^\\d{1,2}-[a-zA-Z]{3}$") ~ ymd(
paste0(
str_replace(str_extract(pub_year, "\\d{1,2}"),
"\\d{1,2}",
paste0('20',str_extract(pub_year, "\\d{1,2}"),'-',str_extract(pub_year, '[a-zA-Z]{3}'),'-','01')))),
str_detect(pub_year, "^[a-zA-Z]{3}-\\d{1,2}$") ~ ymd(
paste0(
str_replace(str_extract(pub_year, "\\d{1,2}"),
"\\d{1,2}",
paste0('20',str_extract(pub_year, "\\d{1,2}"),'-',str_extract(pub_year, '[a-zA-Z]{3}'),'-','01')))),
TRUE ~ as_date(pub_year)
),
needs_fy_sim = case_when( # dummy column. For publications where only "YYYY" dates are provided there isn't enough info to determine fy, so must be simulated
str_detect(pub_year, "^\\d{4}$") ~ TRUE,
str_detect(pub_year, "^(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[0-2])/(\\d{4}$)") ~ FALSE,
str_detect(pub_year, "^([1-9]|1[12])/([1-9]|[12][0-9]|3[01])/(19|20)\\d{2}$") ~ FALSE,
str_detect(pub_year, "^\\d{1,2}-[a-zA-Z]{3}$") ~ FALSE,
str_detect(pub_year, "^[a-zA-Z]{3}-\\d{1,2}$") ~ FALSE
)
) %>%
mutate(
random_number = runif(n()), # generate a column of random numbers
fiscal_year = case_when(
needs_fy_sim == TRUE & random_number < 0.5 ~ as.integer(year(pub_year_cleaned)), # randomly generate a fiscal year based on whether a randomly generated number is above or below 0.5
needs_fy_sim == TRUE & random_number >= 0.5 ~ as.integer(year(pub_year_cleaned))-1,
needs_fy_sim == FALSE & month(pub_year_cleaned) >= 7 ~ as.integer(year(pub_year_cleaned)), # else, just extract month from lubridate dates, using a July - June fiscal year
needs_fy_sim == FALSE & month(pub_year_cleaned) < 7 ~ as.integer(year(pub_year_cleaned))-1
))
Now, we run into a data limitation. If an asset is affiliated with multiple colleges within a particular fiscal year, how do we count it? Does each College get the credit for one entire output, or something else?
Recall that the idea behind this project is - eventually - we may want to investigate how University variables relate to Library database usage. Because of that, we’re not actually interested in the research outputs themselves, they are just an imprecise proxy. Instead, we want some representation of the “effort” a researcher had to go through to publish that asset, “effort” which might have led them to access journal articles, conference papers, etc., through the Library’s databases.
Consequently, it didn’t make much sense to simply assign each College one entire output, because presumably the “effort” of producing the asset is shared across authors. So, rather than assigning one asset per affiliated College, I assigned 1 / (number of units affiliated with the asset)
. So, if one professor from Engineering and one from Science co-authored a paper, each College would get 0.5 of an asset.
Obviously, this 1 / (number of units affiliated with the asset)
is only slightly more nuanced than just assigning 1 to each College. However, implementing it as code at least makes the assumptions and causal story I’m telling more explicit:
%>%
mutate(
count_unique_affiliations = sapply(unit_affiliations, function(i) length(i)),
effort_per_affiliation = 1/count_unique_affiliations)
Afterwards, there’s just some clean-up:
%>%
na.omit(pub_year_cleaned) %>% # remove research outputs without dates
select(-random_number, # remove columns that are no longer necessary
-pub_year,
-needs_fy_sim)
Now, we have a data.frame that lists every research asset, its fiscal year, and associated College(s) - where associated Colleges is a vector of length >= 1. We just need to group by “College + Fiscal Year” and add up the assets.
research_output_counts <- data.frame(
fy_college = # create "College + Fiscal Year" ID values
unlist(
lapply(1:nrow(research_outputs), function(i){
paste0(research_outputs$fiscal_year[[i]],
research_outputs$unit_affiliations[[i]])
})
),
output_effort_units =
unlist(
sapply(1:nrow(research_outputs), function(i){
rep(research_outputs$effort_per_affiliation[[i]], length(research_outputs$unit_affiliations[[i]])) # since an fy_college value is created for each of the units affiliated with a single research output, we have to ascribe the research output's "effort" value as many times as there are academic units
})
),
asset_id =
unlist(
sapply(1:nrow(research_outputs), function(i){
rep(research_outputs$asset_id[[i]], length(research_outputs$unit_affiliations[[i]]))
})
)
)
Before we do the final summation, I also created a debugging data.frame to make sure the sapply(...rep(...))
behavior worked as expected, transferring the proper data in the proper order from research_outputs
to research_output_counts
:
###
# debug - not part of pipe chain, performed before final summing
research_output_debug <- left_join(research_output_counts, research_outputs, by = "asset_id") %>%
select(asset_id,
fy_college,
pub_year_cleaned,
unit_affiliations,
output_effort_units,
effort_per_affiliation) %>%
mutate(
effort_match = case_when(
output_effort_units == effort_per_affiliation ~ 'GOOD',
output_effort_units != effort_per_affiliation ~ 'BAD',
TRUE ~ NA
),
fy_date_match = case_when(
as.integer(str_sub(fy_college, start = 1, end = 4)) == year(pub_year_cleaned) ~ 'GOOD', # for things published July - December
year(pub_year_cleaned) - as.integer(str_sub(fy_college, start = 1, end = 4)) == 1 ~ 'GOOD', # for things published January - June
year(pub_year_cleaned) - as.integer(str_sub(fy_college, start = 1, end = 4)) != 1|0 ~ 'BAD',
TRUE ~ NA
)
)
view(filter(research_output_debug, effort_match == "BAD" | affiliations_match == "BAD" | fy_college == "BAD"))
###
Afterward, we can (finally!), group by fiscal year and college, then sum:
2.4 The Enrollment Numbers Data.frame
Enrollment numbers is reported by term and department, so the ultimate goal is to sum department numbers into College numbers, convert term to fiscal year, and to create fiscal year + College identifiers.
First, we take a look at the College names in the data.frames to check if they’re consistent:
Thankfully, they are, so the rest is quite straightforward:
fall_enrollment <- fall_enrollment %>%
mutate(fy_college = case_when(
str_detect(college, "Agricultural & Life Sciences") ~ paste0(str_sub(term, start = -4, end = -1),'cals'), # "Term" is listed like "Fall 2024" in the document, so we can just take the last four characters
str_detect(college, "Art & Architecture") ~ paste0(str_sub(term, start = -4, end = -1),'caa'),
str_detect(college, "Business & Economics") ~ paste0(str_sub(term, start = -4, end = -1),'cbe'),
str_detect(college, "Education, Health & Human Sci|WWAMI") ~ paste0(str_sub(term, start = -4, end = -1),'cehhs'),
str_detect(college, "Engineering") ~ paste0(str_sub(term, start = -4, end = -1),'coe'),
str_detect(college, "Letters Arts & Social Sciences") ~ paste0(str_sub(term, start = -4, end = -1),'class'),
str_detect(college, "Law") ~ paste0(str_sub(term, start = -4, end = -1),'law'),
str_detect(college, "Natural Resources") ~ paste0(str_sub(term, start = -4, end = -1),'cnr'),
str_detect(college, "^Science$") ~ paste0(str_sub(term, start = -4, end = -1),'cos'),
TRUE ~ NA
)
) %>%
group_by(fy_college) %>%
summarize(enrollment = sum(enrollment, na.rm = TRUE)) # generate data frame that sums enrollment numbers
spring_enrollment <- spring_enrollment %>%
mutate(fy_college = case_when(
str_detect(college, "Agricultural & Life Sciences") ~ paste0(as.integer(str_sub(term, start = -4, end = -1))-1,'cals'), # note the addition of the as.integer(...)-1 here because the "Spring 2024" term is in fiscal year 2023.
str_detect(college, "Art & Architecture") ~ paste0(as.integer(str_sub(term, start = -4, end = -1))-1,'caa'),
str_detect(college, "Business & Economics") ~ paste0(as.integer(str_sub(term, start = -4, end = -1))-1,'cbe'),
str_detect(college, "Education, Health & Human Sci|WWAMI") ~ paste0(as.integer(str_sub(term, start = -4, end = -1))-1,'cehhs'),
str_detect(college, "Engineering") ~ paste0(as.integer(str_sub(term, start = -4, end = -1))-1,'coe'),
str_detect(college, "Letters Arts & Social Sciences") ~ paste0(as.integer(str_sub(term, start = -4, end = -1))-1,'class'),
str_detect(college, "Law") ~ paste0(as.integer(str_sub(term, start = -4, end = -1))-1,'law'),
str_detect(college, "Natural Resources") ~ paste0(as.integer(str_sub(term, start = -4, end = -1))-1,'cnr'),
str_detect(college, "^Science$") ~ paste0(as.integer(str_sub(term, start = -4, end = -1))-1,'cos'),
TRUE ~ NA
)
) %>%
group_by(fy_college) %>%
summarize(enrollment = sum(enrollment, na.rm = TRUE))
2.5 The Grant Funding Data.frame
The ultimate goal for grant funding is basically identical to enrollment, so the code looks very similar.
Again, we make sure the College names are consistent:
In this case, they are decidedly not, so our str_detects
have to be more flexible, but still uniquely pick out the proper Colleges. Also, because grant expenditure is reported for units like the Provost Office, RCDS, etc., to keep it consistent with the enrollment data we have to omit those units (again, another non-trivial research choice):
grant_funding <- grant_funding %>%
mutate(fy_college = case_when(
str_detect(college, regex("Agricultural &|and Life Sciences", ignore_case = TRUE)) ~ paste0('20',fiscal_year,'cals'),
str_detect(college, regex("Letters, Arts &|and Social Sci", ignore_case = TRUE)) ~ paste0('20',fiscal_year,'class'),
str_detect(college, regex("Art &|and Architecture", ignore_case = TRUE)) ~ paste0('20',fiscal_year,'caa'),
str_detect(college, regex("Business &|and Economics", ignore_case = TRUE)) ~ paste0('20',fiscal_year,'cbe'),
str_detect(college, regex("Education|WWAMI|Health", ignore_case = TRUE)) ~ paste0('20',fiscal_year,'cehhs'),
str_detect(college, regex("Engineering", ignore_case = TRUE)) ~ paste0('20',fiscal_year,'coe'),
str_detect(college, regex("law", ignore_case = TRUE)) ~ paste0('20',fiscal_year,'law'),
str_detect(college, regex("Natural Resources", ignore_case = TRUE)) ~ paste0('20',fiscal_year,'cnr'),
str_detect(college, regex("College of Science", ignore_case = TRUE)) ~ paste0('20',fiscal_year,'cos'),
TRUE ~ NA
)) %>% group_by(fy_college) %>%
summarize(grant_funding = sum(grant_funding, na.rm = TRUE),
grant_proposals = sum(as.integer(grant_proposals), na.rm = TRUE)) %>%
na.omit(fy_college)
But, that’s it! We’re done with data cleaning and can move on to creating our ‘tidy’ (but actually not so tidy, as it turns out) data.frames for processing and visualization.