Chapter 1 Cleaning a Dataset of U of I Library Database Usage and Institutional Variables

1.1 Motivation

The U of I Library is continually searching for ways to assess its services to the University community. However, the library does not have standardized workflows for collecting and synthesizing some University-level variables (e.g., incoming grant funding, department/college level enrollment, number and character of research outputs produced, etc.). Moving from informal to formal assessment of these type of variables may provide new opportunities for improvement.

This site is a proof-of-concept for developing those workflows in R. Since this initiative is in its early stages, this site does not focus on making actual inferences from any data. Instead, the goal is to demonstrate how the data can be processed and cleaned for potential quasi-experimental or interpretivist inference.

The project was not entirely successful; there are several flaws which this site will point out later. But - in my opinion - it was successful enough to demonstrate that such work is feasible given a properly formulated research question and realistic expectations of data quality and research design limitations.

1.2 Sources

The “raw data” for this project includes:

  1. Spring and fall department enrollment numbers pulled from the U of I IR Internal Dashboards
  2. University research outputs data pulled from the Library’s VERSO platform
  3. Grant expenditure data from the U of I’s Research Reports Archive, and
  4. U of I Library database usage statistics pulled from our internal LibApps system.

All of these sources are observational and incomplete in a number of ways. A complete evaluation of how that limits useful inference is outside the scope of this site. However, some of those issues will come up in data processing.

1.3 End goals

The goals for this project were to:

  1. Take the raw data (retrieved as xlsx or csv’s) and clean it.
  2. From the cleaned raw data, generate a set of tidy data.frames for further analysis.
  3. Store those data.frames in a SQL database.
  4. Visualize data from the SQL database in R in order to suggest interesting assessment options.

The most important shortcoming of this project was an error in (2). Two of the three final dataframes created were not tidy, one of which in a way that can’t easily be fixed and makes (4) significantly more difficult (4).

1.4 Following along

Because some of the data included in this analysis was confidential, this document does not dynamically process any of it.

However, I created an anonymized version of the data so readers can follow along if it’s helpful. To use it:

  1. Clone this repository.
  2. Extract the contents of the sample-raw-data.zip contained in the ‘raw_data’ directory into the ‘raw_data’ directory.

From there, you can either run the entire analysis all at once using code/00-main.R, or follow along in R scripts 01-06.

Note: you may need to install one or more of the required packages listed in the “Cleaning the Data” section of this site.