Practical Relevance Ranking for 11 Million Books, Part 3: Document Length Normalization

This post is the third in a series by Tom Burton-West, one of the HathiTrust developers, who has been working on practical relevance ranking for all the volumes in HathiTrust for a number of years. His previous posts are available via the Large-Scale Search HathiTrust blog from May and June of 2014, and we reposted them on our blog.

Relevance is a complex concept which reflects aspects of a query, a document, and the user as well as contextual factors. Relevance involves many factors such as the user's preferences, task, stage in their information-seeking, domain knowledge, intent, and the context of a particular search.

Read the full post here: Part 3: Document Length Normalization

Excerpts from Tom's post follow:

Document length normalization is related to term frequency.  Without length normalization, long documents would tend to get ranked above shorter documents even if the term in question is important in the short document but incidental to the topic of the long document.

...most groups using the Vector Space Model have adopted Singhal’s pivoted normalization method.  We suspect that the default Solr/Lucene ranking algorithm, which is loosely based on the vector space model, suffers from the same problem of ranking short documents too high and long documents too low.  Robert Muir  contributed a patch that implements “pivoted document length normalization” for Lucene.

The empirical approach to tuning length normalization parameters using test collections has proven to be effective in tests on various TREC collections and other IR research collections.  However, there remain many unanswered questions for a practitioner working with production systems...