Chapter 3 Creating & Populating SQL Database

3.1 Aggregating University-Level Variables

Since we’ve made sure grants, research outputs, and enrollment numbers all have the fiscal year + College identifier, joining those tables together is easy:

university_variables <- full_join(spring_enrollment, fall_enrollment, by = "fy_college") %>%
  full_join(research_output_counts, by = "fy_college") %>%
  full_join(grant_funding, by = "fy_college")

However, here we run into a problem about the range of the fiscal year + College identifiers. The Library’s VERSO system has assets going back to the mid-20th century all the way up to 2024. But, the raw data doesn’t include enrollment or grant numbers for that entire range.

Turns out, according to…

print(sort(na.omit(spring_enrollment_sums$fy_college))[1]) # sort the fy_college columns and get the first member
print(sort(na.omit(spring_enrollment_sums$fy_college))[length(na.omit(spring_enrollment_sums$fy_college))]) # get the last member
print(sort(na.omit(fall_enrollment_sums$fy_college))[1])
print(sort(na.omit(fall_enrollment_sums$fy_college))[length(na.omit(fall_enrollment_sums$fy_college))])
print(sort(na.omit(research_output_counts$fy_college))[1])
print(sort(na.omit(research_output_counts$fy_college))[length(na.omit(research_output_counts$fy_college))])
print(sort(na.omit(grant_funding$fy_college))[1])
print(sort(na.omit(grant_funding$fy_college))[length(na.omit(grant_funding$fy_college))])

…our raw data only covers all of these variables from FY 2015 to 2022. Rather than deal with incomplete data, I chose to omit all other years.

Joining this table also forced me to make an interesting decision about enrollment. Spring and fall enrollment came in two different tables, and the numbers in each are slightly different. Given that the observational unit is the fiscal year that spans both terms, how should I handle that?

Again, if we might want our assessment to tell a story about how the number of people enrolled in the University affects Library database use, it seems plausible to just add the two numbers. Originally, this seemed odd to me because many of the people enrolled in fall semester will probably also be enrolled in spring semester so it seemed like double-counting. However, similar to research outputs, we’re not interested in individual people per se, but instead the effort they might expend at the University, leading them to use Library resources. Someone enrolled for two semesters presumably expends two semesters’ worth of effort, so they can plausibly be counted twice.

This is imprecise but, again, at least we’ve codified (code-ified! :D) the assumption:

%>%
  na.omit(fy_college) %>% # remove anything not between 2015-2022
  mutate(total_enrollment = enrollment.x+enrollment.y %>% # sum spring & fall enrollment 
  select(-enrollment.x, # remove columns that are no longer necessary
         -enrollment.y)

3.2 Aggregating Library Database Usage Data

This is where my second, and more serious, error with tidiness occurred.

Essentially, the original spreadsheet contained a Library database in each row. But, the database view counts for every month from fiscal year 2015 - 2022 each had their own column. So, Web of Science would have a columns for ‘2015.07 total views’, ‘2015.08 total views’, etc. My mistake was to just sum the proper columns for each row:

database_stats <- left_join(az_database_usage, az_database_list, by = join_by(ID == id)) # join data.frames containing database names, id's etc., with dataframe containing usage info
database_stats <- database_stats %>% 
  mutate(total_views_fy15 = rowSums(database_stats[,12:23]), # sum monthly viewership columns into fiscal year viewerships
         total_views_fy16 = rowSums(database_stats[,24:35]),
         total_views_fy17 = rowSums(database_stats[,36:47]),
         total_views_fy18 = rowSums(database_stats[,48:59]),
         total_views_fy19 = rowSums(database_stats[,60:71]),
         total_views_fy20 = rowSums(database_stats[,72:83]),
         total_views_fy21 = rowSums(database_stats[,84:95]),
         total_views_fy22 = rowSums(database_stats[,96:107]),
         total_views_fy23 = rowSums(database_stats[,108:119]),
         remaining_views = rowSums(database_stats[,c(5:11,120:125)]),
  ) %>% 
  select(
    database_id = "ID",
    database_name = "Name",
    database_subjects = "subjects",
    total_database_views = "Total",
    total_views_fy15,
    total_views_fy16,
    total_views_fy17,
    total_views_fy18,
    total_views_fy19,
    total_views_fy20,
    total_views_fy21,
    total_views_fy22,
    # total_views_fy23,
    # remaining_views,
  )

### 
# debug
# database_stats <- database_stats %>% 
#   mutate(math_check = rowSums(database_stats[,5:14]) == total_database_views) %>% 
#   filter(math_check == FALSE)
###

What I should have done was to never join the az_database_list and az_database_usage data.frames in the first place. Instead, I should’ve linked the az_database_list table - which contained things like database_names to the database usage once via some id.

I didn’t do that because of three limitations in my thinking:

I unquestioningly accepted that all the database info, including views, should remain in the same data.frame (and therefore the same table in the SQL database). It seemed obvious to me that doing so would make processing easier; something I only discovered was wrong late in the project when attempting to visualize interesting relationships.
I knew that having many repetitive entries within a single column can be an indicator of untidy data. Given (1) above as a constraint, the easiest way to aggregate all database view information into a single column was to pivot_longer the database table. E.g., using something like this:

# an unusued version of the database_statistics data.frame.
transposed_database_stats = pivot_longer(database_stats, cols = total_views_fy15:total_views_fy22) %>%
  select(id,
         database_name,
         relevant_colleges,
         fiscal_year = "name",
         views = "value") %>%
  mutate(fiscal_year = paste0("20",str_sub(fiscal_year, -2, -1)),
         fy_college = lapply(1:length(relevant_colleges), function(x){
           lapply(relevant_colleges[[x]], function(y) paste0(fiscal_year[[x]],y))
         }))

However, the snippet above leads to lots of repetition in columns like database_name, id (i.e., a database id), and relevant_colleges, so it seemed wrong to me.

The final limitation was that, after worrying about repetitive entries but failing to realize splitting the data.frame was a reasonable option, I failed to notice that keeping each FY’s views in its own column ultimately violates the tidy data precept of one variable per column.

Regardless, I continued on my merry way until I started trying to visualize things.

3.3 Connecting Databases to Colleges

In the meantime, I still need to somehow connect databases to our other observational unit of “FY + College”. To do so, I keyword search the database_stats “subjects” column - “subjects” as in “these are academic subjects where a researcher might find this database useful” - assuming there is some connection between:

The work researchers do and the Colleges they are enrolled in.
The subject tags and how users actually choose which database to access.

%>% 
  mutate(relevant_colleges =lapply(1:nrow(database_stats), function(i){
    college_list = c( # initialize a vector to contain relevant Colleges for each database
      case_when(str_detect(database_subjects[i], regex("Business|Economics", ignore_case = TRUE)) ~ "cbe"), # make keyword searches to determine relevant Colleges based on database "subjects"
      case_when(str_detect(database_subjects[i], regex("Computing|Electronics|Engineering|Chemistry|environmental", ignore_case = TRUE)) ~ "coe"),
      case_when(str_detect(database_subjects[i], regex("Education|Health|Medicine|curriculum", ignore_case = TRUE)) ~ "cehhs"),
      case_when(str_detect(database_subjects[i], regex("history|philosophy|religious|literature|anthropology|sociology|political|psychology|international|language", ignore_case = TRUE)) ~ "class"),
      case_when(str_detect(database_subjects[i], regex("architecture|music|literature", ignore_case = TRUE)) ~ "caa"),
      case_when(str_detect(database_subjects[i], regex("agricultural|family|geospatial|veterinary", ignore_case = TRUE)) ~ "cals"),
      case_when(str_detect(database_subjects[i], regex("physics|earth|chemistry|biology|geography|geology|geospatial|mathematics", ignore_case = TRUE)) ~ "cos"),
      case_when(str_detect(database_subjects[i], regex("geospatial|forestry|wildlife|fisheries|geography|environmental", ignore_case = TRUE)) ~ "cnr"),
      case_when(str_detect(database_subjects[i], regex("law", ignore_case = TRUE)) ~ "law"))
    
    college_list <- college_list[!is.na(college_list)]
  }),
  relevant_fycolls = lapply(1:length(relevant_colleges), function(x){ # iterate through relevant_colleges column
    fiscal_year = c(2015:2022) # this vector is b/c the dataset spans FY's 15-22
    unlist(lapply(relevant_colleges[[x]], function(y) paste0(fiscal_year,y)))})) %>% # since each entry of relevant_colleges is a vector of length n >= 1, we have to iterate through them and apply a paste0 to each entry in order to generate a list of relevant FY-coll ID's
  filter(database_name != "[Deleted]") %>% # databases which the library no longer has access to were recorded as "[Deleted]" in the spreadsheet. Since they had incomplete data, they were removed from the analysis.
  filter(relevant_colleges != "character(0)") # Some databases were not tagged with any subjects (subject tagging is often manually done by librarians), so they were removed from the analysis.

With this information I can now generate a data.frame that records which databases are relevant to which FY’s and Colleges:

database_fycoll_combos <- data.frame(
  database_id = unlist(sapply(1:nrow(database_stats), function(i){
    rep(database_stats$database_id[i], length(database_stats$relevant_fycolls[[i]])) # since databases might be relevant to more than one college and fiscal year we have to repeat their database_id entry.
  })),
  relevant_fycolls = unlist(sapply(1:nrow(database_stats), function(i) database_stats$relevant_fycolls[[i]])))

database_fycoll_combos <- database_fycoll_combos %>% 
  mutate(combo_id = UUIDgenerate(n=nrow(database_fycoll_combos), output = "string")) # generate universally unique id for each database FY-college combo

database_stats <- database_stats %>% # remove the now-unnecessary columns from the database_stats data.frame.
  select(-relevant_colleges,
          -relevant_fycolls
)

As a complete aside, I still do not understand why I could not declare and process the database_fycoll_combos data.frame within one pipe chain. The error I got when I tried to do so said database_fycoll_combos does not exist. Thoughts would be much appreciated!

3.4 Creating the SQL DB and Writing to it

That does it for our final, processed data.frames! We can now create a SQL database to store them so we don’t have to do all that data processing every time we want to work with them or send them to collaborators/reproducers.

Because none of the columns in our final data.frames have nested lists, and we didn’t see any special characters that might cause encoding issues in the data, doing so is quite straightforward. We do have some columns (e.g., “database_name”) where the lengths of entries are not standardized, but since this is a relatively small dataset we can just provide generous memory allocations without worrying about finding their exact maximum length.

predicting_resource_usage_db <- dbConnect(RSQLite::SQLite(), "processed_data/predicting_resource_usage.db") # initialize db

# create tables
dbExecute(predicting_resource_usage_db, "CREATE TABLE university_variables (
          fy_college varchar(9) NOT NULL,
          total_outputs float NOT NULL,
          grant_funding float NOT NULL,
          grant_proposals integer NOT NULL,
          total_enrollment integer NOT NULL,
          PRIMARY KEY (fy_college)
          );")

dbExecute(predicting_resource_usage_db, "CREATE TABLE resource_variables (
          database_id varchar(8) NOT NULL,
          database_name varchar(100) NOT NULL,
          database_subjects varchar(100),
          total_database_views integer,
          total_views_fy15 integer,
          total_views_fy16 integer,
          total_views_fy17 integer,
          total_views_fy18 integer,
          total_views_fy19 integer,
          total_views_fy20 integer,
          total_views_fy21 integer,
          total_views_fy22 integer,,
          PRIMARY KEY (database_id)
);")

dbExecute(predicting_resource_usage_db, "CREATE TABLE database_fycoll_combos (
          combo_id varchar(50),
          database_id varchar(8) NOT NULL,
          relevant_fycolls varchar(9) NOT NULL,
          PRIMARY KEY (combo_id),
          FOREIGN KEY (relevant_fycolls) REFERENCES university_variables(fy_college),
          FOREIGN KEY (database_id) REFERENCES resource_variables(database_id)
);")

# write data.frames to tables
dbWriteTable(predicting_resource_usage_db, "university_variables", university_variables, overwrite = TRUE) # overwrite = TRUE because, at this early proof-of-concept stage I don't anticipate the dataset getting any bigger and I didn't want to accidentally append a bunch of data to my db while repeatedly running scripts for debugging.
dbWriteTable(predicting_resource_usage_db, "resource_variables", database_stats, overwrite = TRUE)
dbWriteTable(predicting_resource_usage_db, "database_fycoll_combos", database_fycoll_combos, overwrite = TRUE)

On to visualizations!