Lessons Learned from a Close Analysis of EZproxy Log Files

Introduction

There has been a vibrant debate in the professional literature about the role academic libraries might play in understanding and contributing to student success. Many reasons exist for why libraries have not undertaken an evaluation of their impact on student success through learning analytics (for example, concerns about user privacy, lack of internal technical infrastructure, and underdeveloped campus partnerships). In many libraries, these important concerns have led to a retreat from larger campus conversations around learning analytics and student success. 

Like many other academic libraries, the University of Michigan library destroyed identifiable patron data as soon as the immediate business need for it expired. In the mid 2010s, we went through a lengthy process to update our privacy statement. In practice, the updated policy did not lead to immediate changes in retention policies, but it was fundamental to a future opportunity that would arise in 2020. During this time, the library was also dabbling in campus learning analytics conversations and projects.

In 2018, the University of Michigan’s Institute for Social ResearchUniversity Library, and School of Information received a multi-year IMLS grant, the “Library Learning Analytics Project” (IMLS grant LG-96-18-0040-18), to pilot technical approaches to including library use data into larger campuswide metrics. This study was reviewed and approved by the Health Sciences and Behavioral Sciences Institutional Review Board (IRB-HSBS) at the University of Michigan (HUM00146232). 

This blog post is based on research more thoroughly described in “Longitudinal Associations between Online Usage of Library-Licensed Content and Undergraduate Student Performance” (College & Research Libraries, 85/4, May 2024). In this post, we share insights on the methodology, what we learned through the research project, and thoughts on future learning analytics research for academic libraries.

Methodology

The grant funded several key areas of work. 

  • We built a centralized and secure storage platform for transactional logs from many of the library’s web-based services, allowing analysis. We maintained the existing log rotation schedule, where the server logs were removed after two weeks. 
  • The Institute for Social Research (ISR) leveraged decades of experience with highly secure data from its other research, including social security information, personal health information, and more, to build a highly restricted data enclave accessible to only a handful of specific individuals from specific computers in a locked room. This ensured the highest possible security for the analysis we conducted. 
  • We used campus logins to match library web-based activities with individual students and compared that to information in the campus data warehouses.

Once we had organized secure log capture processes and established data flows to store log files for analysis in ISR’s secure data enclave, we selected a first set of logs to analyze and connect to other campus data. Among all the data sources for us to focus on, we settled on EZproxy transactions. (EZproxy, an OCLC product, is used to provide authenticated access to the library’s licensed content for users not on the Ann Arbor campus.) Even though this was by far the largest set of data we had available (almost 700,000,000 records from fall 2016-2019), it did not include all use of this kind of content. For example, It does not include usage of resources by individuals who were using a campus network (including those in university residences and other buildings or on the campus VPN from anywhere in the world). Despite these limitations, this was the most comprehensive set of data for our initial study. Combining this number of transactions with student demographic data for Michigan’s 45,000 students, provided an opportunity to learn what was possible, and what was needed in terms of expertise and computing capacity.

Keeping concerns about individual privacy in mind, we established at the outset of the analysis process that we would not report on small populations of users so as to preserve individual anonymity.

What We Learned

In addition to gaining insights about who was using the library’s licensed resources, we learned about limitations to the data, as well. Not all individuals do their work off campus. And, to the extent that libraries and their vendors move away from the admittedly imperfect use of proxy servers to mediate access, libraries will lose this independent view into how the resources they provide their campuses are being used and may need to rely more heavily on more anonymous vendor-provided statistics.  

In thinking about what insights we gleaned from this study and what if any of them are actionable, it takes a fair amount of understanding at the academic program level.  For example, understanding who is using our resources and who is not, is fairly easy to see through our data. In looking at gender, we see a surprising data difference: 65% of off campus female students had an EZproxy session while only 50% of off campus male students did.  Understanding why these use patterns exist is an entirely different matter. For that, we need a much greater understanding of the expectations and requirements of specific degree programs as well as the demographics of who is enrolled in those programs.

While our study found that an EZproxy session during an academic term was correlated with a 0.14 point increase in semester GPA, we have to remember this is correlation not causation. We removed other confounding factors, but we still could not say with certainty that use of the library definitively led to the semester GPA increase. 

Despite collecting significant user data, we acknowledged important  populations were excluded from this study, namely students living on campus and those using the campus VPN from off campus.  As most of our first year students live on campus and use the campus network, thus not needing to use EZproxy, their usage is excluded from this study. Many of our health and engineering students use the VPN, so likewise, their usage was excluded. 

What Could Come Next for Library Learning Analytics?

We definitely could do more analysis of specific academic programs with the data from this study. The data could help us answer some questions about students within specific programs who use or do not use library resources and examine measures of student success including GPA. Would the results of a programmatic analysis look different that the results for all of our undergraduate students overall? Analyzing data within a specific academic program would allow us to share results with the directors of these programs, potentially changing our engagement and partnership.

We have discussed what it would take to conduct a similar large scale study in the future.  Would we do this again and if so, would we do anything differently? This study was a massive undertaking and used resources and expertise both within our library and across campus. We are not sure that we could assemble such a research team again, but if we did, we would definitely wish to find a way to capture more complete data of library usage beyond just data from EZproxy. We would also find ways to make the data more meaningful and granular, for example, to analyze library usage of students within individual programs. 

If we were to undertake a process like this again, we would take advantage of a new learning analytics standard developed in a separate IMLS grant-funded project, Connecting Learning and Library Analytics for Student Success (CLLASS). This new protocol for modeling and storing student interactions with library resources, the Caliper Library Profile, would have greatly simplified and accelerated some of the time- and storage-intensive parts of our project.. For more information on this project, see the CLLASS final report.

The overall trend in academic libraries is moving from library-mediated to campus-mediated access controls for authentication, etc. While this represents an overall improvement to the user experience, it simultaneously makes it harder for the library to collect learning analytics data in order to understand usage patterns or explore characteristics of specific student populations in the future.