Author "Significance" From Catalogue Data

NOVEMBER 5, 2009

Continues the series of post related to analyzing catalogue data, here are some stats on author “significance” as measured by the number of book entries (‘items’) for that author in the Cambridge University Library catalogue from 1400-1960 (there being 1m+ such entries).

I’ve termed this measure “significance” (with intentional quotes) as it co-mingles a variety of factors:

  • Prolificness – how many distinct works an author produced (since usually each work will get an item)
  • Popularity – this influences how many times the same work gets reissued as a new ‘item’ and the library decision to keep the item
  • Merit – as for popularity

The following table shows the top 50 authors by “significance”. Some of the authors aren’t real people but entities such as “Great Britain. Parliament” and for our purposes can be ignored. What’s most striking to me is how closely the listing correlates with the standard literary canon. Other features of note:

  • Shakespeare is number 1 (2)
  • Classics (latin/greek) authors are well-represented with Cicero at number 2 (4), Horace at 5 (9) followed Homer, Euripides, Ovid, Plato, Aeschylus, Xenophon, Sophocles, Aristophanes and Euclid.
  • Surprise entries (from a contemporary perspective): Hannah More, Oliver Goldsmith, Gilbert Burnet (perhaps accounted by his prolificity).
  • Also surprising is limited entries from 19th century UK with only Scott (26), Dickens (28) and Byron (41)
RankNo. of ItemsName
13112Great Britain. Parliament.
21154Shakespeare, William
31076Church of England.
4973Cicero, Marcus Tullius
5825Great Britain.
6766Catholic Church.
7721Erasmus, Desiderius
8654Defoe, Daniel
13527Swift, Jonathan
14520Goethe, Johann Wolfgang Von
15486Rousseau, Jean-Jacques
17444Milton, John
18388Sterne, Laurence
19387England and Wales. Sovereign (1660-1685 : Charles II)
22358Goldsmith, Oliver
25349Alighieri, Dante
26338Scott, Walter (Sir)
27326More, Hannah
28322Dickens, Charles
30304Burnet, Gilbert
31302Luther, Martin
32295Dryden, John
35262Pope, Alexander
36259Fielding, Henry
38250Calvin, Jean
41247Byron, George Gordon Byron (Baron)
42247Bacon, Francis
4324have 7Chen
46235Augustine (Saint, Bishop of Hippo.)
47232Burke, Edmund
48223Johnson, Samuel
49222Bunyan, John
50222De la Mare, Walter

Top 50 authors based on CUL Catalogue 1400-1960

The other thing we could look at is the overall distribution of titles per author (and how it varies with rank – a classic “is it a power law” question). Below are the histogram (NB log scale for counts) together with a plot of rank against count (which equates, v. crudely, to a transposed plot of the tail of the histogram …). In both cases it looks (!) like a power-law is a reasonable fit given the (approximate) linearity but this should be backed up with a proper K-S test.


Histogram of items-per-author distribution (log-log)


Rank versus no. of items (log-log)


  • K-S tests
  • Extend data to present day
  • Check against other catalogue data
  • Look at occurrence of people in title names
  • Look at when items appear over time


Code to generate table and graphs in the open Public Domain Works repository, specifically method ‘person_work_and_item_counts’ in this file: