Author "Significance" From Catalogue Data

NOVEMBER 5, 2009

Continues the series of post related to analyzing catalogue data, here are some stats on author “significance” as measured by the number of book entries (‘items’) for that author in the Cambridge University Library catalogue from 1400-1960 (there being 1m+ such entries).

I’ve termed this measure “significance” (with intentional quotes) as it co-mingles a variety of factors:

  • Prolificness – how many distinct works an author produced (since usually each work will get an item)
  • Popularity – this influences how many times the same work gets reissued as a new ‘item’ and the library decision to keep the item
  • Merit – as for popularity

The following table shows the top 50 authors by “significance”. Some of the authors aren’t real people but entities such as “Great Britain. Parliament” and for our purposes can be ignored. What’s most striking to me is how closely the listing correlates with the standard literary canon. Other features of note:

  • Shakespeare is number 1 (2)
  • Classics (latin/greek) authors are well-represented with Cicero at number 2 (4), Horace at 5 (9) followed Homer, Euripides, Ovid, Plato, Aeschylus, Xenophon, Sophocles, Aristophanes and Euclid.
  • Surprise entries (from a contemporary perspective): Hannah More, Oliver Goldsmith, Gilbert Burnet (perhaps accounted by his prolificity).
  • Also surprising is limited entries from 19th century UK with only Scott (26), Dickens (28) and Byron (41)
Here's
RankNo. of ItemsName
13112Great Britain. Parliament.
21154Shakespeare, William
31076Church of England.
4973Cicero, Marcus Tullius
5825Great Britain.
6766Catholic Church.
7721Erasmus, Desiderius
8654Defoe, Daniel
9620Horace
10599Aristotle
11547Voltaire
12539Virgil
13527Swift, Jonathan
14520Goethe, Johann Wolfgang Von
15486Rousseau, Jean-Jacques
16479Homer
17444Milton, John
18388Sterne, Laurence
19387England and Wales. Sovereign (1660-1685 : Charles II)
20386Euripides
21372Ovid
22358Goldsmith, Oliver
23358Plato
24351Wang
25349Alighieri, Dante
26338Scott, Walter (Sir)
27326More, Hannah
28322Dickens, Charles
29315Aeschylus
30304Burnet, Gilbert
31302Luther, Martin
32295Dryden, John
33290Xenophon
34280Sophocles
35262Pope, Alexander
36259Fielding, Henry
37258Li
38250Calvin, Jean
39248Zhang
40247Aristophanes
41247Byron, George Gordon Byron (Baron)
42247Bacon, Francis
4324have 7Chen
44245Terence
45241Euclid
46235Augustine (Saint, Bishop of Hippo.)
47232Burke, Edmund
48223Johnson, Samuel
49222Bunyan, John
50222De la Mare, Walter

Top 50 authors based on CUL Catalogue 1400-1960

The other thing we could look at is the overall distribution of titles per author (and how it varies with rank – a classic “is it a power law” question). Below are the histogram (NB log scale for counts) together with a plot of rank against count (which equates, v. crudely, to a transposed plot of the tail of the histogram …). In both cases it looks (!) like a power-law is a reasonable fit given the (approximate) linearity but this should be backed up with a proper K-S test.

culbooks_person-item-hist-logxlogy.png

Histogram of items-per-author distribution (log-log)

culbooks_person-item-by-rank-logxlogy.png

Rank versus no. of items (log-log)

TODO

  • K-S tests
  • Extend data to present day
  • Check against other catalogue data
  • Look at occurrence of people in title names
  • Look at when items appear over time

Colophon

Code to generate table and graphs in the open Public Domain Works repository, specifically method ‘person_work_and_item_counts’ in this file: http://knowledgeforge.net/pdw/hg/file/tip/contrib/stats.py