Size of the Public Domain II

JULY 16, 2009

This follows up my previous post. Here we are going to calculation public domain numbers based directly on authorial birth/death date information rather than on guesstimated weightings. We’re going to focus on the Cambridge University Library (CUL) data we used previously.

Pub. DateTotalNo AuthorAny DateDeath Date
1870-1880505646634 (13%)23016 (45%)21876 (43%)
1880-1890668578225 (12%)31135 (46%)28570 (42%)
1890-1900668838733 (13%)32169 (48%)28971 (43%)
1900-1910703608594 (12%)35401 (50%)29922 (42%)
1910-1920604897722 (12%)31336 (51%)24608 (40%)
1920-1930786709023 (11%)44219 (56%)32658 (41%)
1930-19409057611004 (12%)46849 (51%)29372 (32%)
1940-1950726927638 (10%)36495 (50%)22155 (30%)

Table 1: PD Relevant Information Availability

Table 1 presents a summary of how much relevant information is available for items (books) of particular vintages in the CUL catalogue – we only show data from 1870 to 1950 on the presumption that (almost) all pre-1870 publications are PD (their authors would have had to live for more than 70 years post-publication for this not to be the case) and almost all publications post 1950 are in copyright today (their authors would have to have died before 1940 for this not to be the case).

As the table shows, at best only just over 40% of items have a recorded authorial death date and extending to include birth dates only raises this proportion to, at best, the mid mid-to-low fifties. Taking account of items which lack any associated author, raises these figures somewhat further to around 60%, though we should note that the reason for the lack of an associated author is not clear – is it because they are genuinely anonymous or simply because the information has not been recorded? Thus, even for the earliest items listed a large proportion of items (50% or more) lack the necessary information for direct computation of public domain status.

At the same time, we can take some heart, and some interesting facts, from this table. First, a reasonable proportion, amounting to many thousands of items, did have associated death dates. Second, at least for older items, the majority of those with any date had a death date (95% for 1870-1880 and still at over 70% for 1920-1930). Third, and this is a more general observation, proportions were surprisingly constant over time. For example, the proportion of ‘anonymous’ items lies in a narrow band between 10% and 13% for the entire periods. Similarly the proportion of items with any date information ranged only from 45% to 56%. At the same time, and reassuringly, though the proportion with death dates is relatively constant for the oldest periods, in the more recent ones it falls substantially; as one would expect given that some of the authors from those more recent eras are still alive.

Pub. DateTotalPDNot PD?Prop 1Prop 2
1870-18805056522157 (43%)68 (0%)28340 (56%)99%96%
1880-18906685828325 (42%)649 (0%)37884 (56%)97%90%
1890-19006688426723 (39%)2418 (3%)37743 (56%)91%83%
1900-19107036224032 (34%)5838 (8%)40492 (57%)80%67%
1910-19206049116200 (26%)8306 (13%)35985 (59%)66%51%
1920-19307867116127 (20%)16351 (20%)46193 (58%)49%36%
1930-1940905838973 (9%)20835 (23%)60775 (67%)30%19%
1940-1950726965000 (6%)19316 (26%)48380 (66%)20%13%

Table 2: PD Status by Decade. '?' indicates items where PD status could not be computed. Prop(ortion) 1 equals total PD divided by total for which status could be computed (sum of total PD and Not PD). Prop(ortion) 2 equals total PD divided by number of items for which any author date was known ('Any Date' in previous table).

Table 2 reports the results of direct computation of PD status based on the information available. Note that, in doing these computations, we have augmented the basic life plus 70 rule with the additional assumptions that a) all items published in 1870 or before are PD b) no author is older than 100 (so if a birth date is more 170 years ago the item is PD) c) every author lives at least until 30 (so that any work published by an author born less than a 100 years ago is automatically not PD).

As is to be expected, for the majority of the periods, the availability of PD status (either PD or Not PD) closely tracks the availability of death date information – the total for which PD status can be determined (the sum of PD and Not PD) almost exactly equals the total for which death date information is available. It is only in the last period 1940-1950 that the birth date appears to make any contribution. More interesting, is how the number PD and Not PD vary over time, especially relative to each other (and as a proportion of the records for which any date is available).

These two proportions/ratios are recorded in the last two columns which record, respectively: 1) the PD total relative to the number of items for which any status could be computed (i.e. the sum of PD and Not PD) 2) the PD total relative to the total number of items for which any date information is available. These ratios change dramatically over the periods shown: starting in the 1870-1880 period in the high 90%s by the 1940s they are down to 20% or below.

Pub. Date% PD

Table 3: Suggested PD Proportions

The key question for us is how to extrapolate these PD proportions to the full set of records – i.e. from the set of records for which there is the necessary birth/death date information to that where there is not. The simplest, and most obvious, approach is to assume that the proportions are identical and therefore that the PD proportions calculated on the partial dataset apply to the whole. However, there are some obvious deficiencies in this approach.

In particular, our ability to compute a PD status is largely linked to the existence of a death date and it is likely that the presence of this information is itself correlated with authorial age – after all a death date can only exist once that person has died! This correlation, and the bias it gives rise to, is probably small in the early periods – the authors of any pre 1930 work are almost certainly no longer alive today. However, for the later periods, the bias may be more substantial – it is in these last two periods (1930-1940 and 1940-1950) that there is a significant reduction in the number of records with a death date and (relatedly) a significant increase in the number of records for whom the PD status is unknown.

Thus, in converting the partial PD proportions to full PD proportions it seems sensible to revise down somewhat the partial figures with the revision being greater in later periods. Moreover, we have a lower bound for any downwards revision provided by the total PD as a proportion of all records – which even in the 1940-1950 period stood at 6%. In light of these considerations Table 3 gives fairly conservative figures for PD proportions that when estimating PD size based on publication dates. Interestingly, even with out conservative assumptions, these proportions are rather higher than those used in our previous analysis.