Friday, January 21, 2011

Digital history and the copyright black hole

In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I've called 1922 the year digital history ends before; for the kind of work I want to see, it's nearly an insuperable barrier, and it's one I think not enough non-tech-savvy humanists think about. So let me dig in a little.

The Sonny Bono Copyright Term Extension Act is a black hole. It has trapped 95% of the books ever written, and 1922 lies just outside its event horizon. Small amounts of energy can leak out past that barrier, but the information they convey (or don't) is miniscule compared to what's locked away inside. We can dive headlong inside the horizon and risk our work never getting out; we can play with the scraps of radiation that seep out and hope it adequately characterizes what's been lost inside; or we can figure out how to work with the material that isn't trapped to see just what we want. I'm in favor of the latter: let me give a bit of my reasoning why.

My favorite individual ngram is for the zip code 02138. It is steadily persistent from 1800 to 1922, and then disappears completely until the invention of the zip code in the 1960s. Can you tell what's going on?

The answer: we're looking at Harvard library checkout slips. There are a fair number of Harvard library bookplates in books published before 1922, and then they disappear completely because Harvard didn't want any in-copyright books included in the Googleset. That's not true of all libraries; in fact, you can see that University of Michigan book stamps actually spike up to make up for the Widener shortfall from 1922. Type in some more library names, and you can see a big realignment right at the barrier year:

So what? There are probably some differences in library collections, scanning conditions, book quality, etc., that might induce a little actual error into ngrams. But it's not enough to worry about, I'm willing to bet. My point is more general; the jump highlights the fact that we basically have two separate archives for digital history. You can stitch them together, but the seams remain if you know where to look for them; and we need to think about how to use those two archives differently. Because one of them is a heck of a lot better than the other.

That's not obvious from many of the digital tools built for humanists so far. A lot of our academic databases pretend these disjunctures don't exist, as if the information that seeps out past the collapsed star of the American publishing industry is all scholars need. Ngrams treats all of history equally. Jstor doesn't give you any more access to pre-1922 journals than it does to ones after. With institutions like worldcat, we're letting some of our old data accrue a trail of metadata inside the event horizon that starts to drag them down, away from true open accessibility.

But if we don't think about this line and build our plans around it, we'll miss out on the most exciting possibilities for digital humanities work. I am convinced that the best digital history—not the most exciting, not the most viewed, but the most methodologically sophisticated and the most important for determining what's possible—is going to be done on works outside the copyright event horizon. Everything exciting about large-scale textual analysis—topic modelling, natural language processing, and nearly anything I've been fiddling with over the last couple months—requires its own special way of breaking down texts. For the next few years, we're going to see some real progress on a variety of fronts. But we need to figure just what we can get out of complete texts before we start chopping them up into ngrams or page snippets or digital books you can check out a chapter at a time. And since we have the complete texts only outside the black hole, that's where we're going to figure it out. We'll certainly keep trying things out on the scant information that escapes, but our sense of how well that works will be determined by what we can do with the books we can actually read.

That's already happening. The Victorian Books has picked just about exactly the years you would want to use, which is one of the reasons it's potentially exciting for history. (Although the Google metadata, like worldcat, seems to have a trailing edge inside the event horizon.) The Stanford folks seem to stand a little more surely outside on the 19th century side of the digital divide, though I admit I still don't have a good idea of other history (that's my main interest, of course) being done on the largest scales. The MONK datasets seem to have restrictions of some type? (I should probably do my research a little better, but this is a polemic, not a research paper.) In American history, the period with the richest textual lode is the Gilded Age-Progressive Era, which has been searching for a reformation of the historiography for years. If I were to stay in the profession, there's no question that's where I'd be most inclined to plant my flag. In any case, I think of this blog in large part as a place to figure out roughly what I'd like to make sure someone gets to do; what tools and techniques seem promising or interesting for studying changes in language over time that certain restrictions on texts might preclude.

But as services spring up, like ngrams or the Jstor data for research (I think--I haven't quite figured that one out yet), all this diversity tends to get collapsed into just one type of measure. Convenient web interfaces giveth, convenient web interfaces taketh away. One of the more obscure things that troubles me about ngrams, as I said, is that it papers over the digital divide by enforcing the copyright rules even on the out-of-copyright material. In a way, I think of the kind of tokenized data that the Culturomists managed to cajole out of Google as not just a database but also an anonomyzing scramble, like on a real-crime show. The Genomics parallel works well here—except that while individual privacy needs to be protected in genetic databases, there is no compelling reason to hide away information about the lexical makeup of books. I heard that at the AHA, Culturomics said they had trouble getting anyone at Harvard to host even the ngrams datasets for fear of copyright infringement. That is a) completely insane; b) sadly believable, and c) appears to suggest the hugely aggregated Google ngrams data about as far in the direction of openness as we can get along the many-word-token line.

Maybe, though, the approach the Culturomists take to dealing with the copyright period isn't the best one. At the least, it isn't the only possible one. Since not everything is subject to the crazy distortions of reality that apply inside the black hole, we can find that out. That's why investigation into the pre-copyright texts is the most important task facing us over the next few years. Before the service providers decide what sort of access we get to books for the next decade, as they did with journal articles around 1995 or whenever, digital humanists themselves need to decide what we want. One of the things I think we can learn by looking at the pre-copyright texts is just what sort of data is most useful. For example: I can derive sentence-level collocations data in my database for 30,000 books, but for the last month or so I haven't found myself needing to use it much. Instead, I just use correlations for word use across the books and multi-word search. Maybe when I actually try to write a paper, I'll find that I need the sentence data again. But if not, maybe just the metadata-linked 1-grams would be enough. Could that slip out of the black hole? I don't know.

(In practice, it seems insane to me that the wordcount data for a book would be copyright protected. In my fondest dreams, I'd like to see the AHA push the boundaries on some of these copyright issues a few years down the road. They could post a number of post-1922 tokenized texts on its website to provoke a lawsuit and possibly clear out just a few types of lexical data as fair-use. But then again, most things about copyright law seem insane to me, and we're a long ways off from any organization caring enough about the digital humanities to go to court to defend its building blocks.)

So long as we fully appreciate what we have in the public domain, the overall situation looks pretty good. The black hole isn't expanding; and we may be able to get quite a bit of information over that event horizon yet. Plus, the possibilities for all kinds of interesting work exist given the amount of data and metadata that's in the true public domain. As our hard-drives grow and processors speed up, it gets increasingly feasible to deal with massively large bodies of text on small platforms. There just aren't that many books published before 1922; a laptop hard drive could almost certainly now fit a compressed Library of Congress, which wasn't true even five years ago. I'm trying to adapt my platform to use more of the open library metadata I just linked to, and it's an embarassment of riches.

Further to the good is that nearly all the active players are on the side of the angels. Most digital humanists themselves, from the Zotero commons to the Culturomics datasets, are eagerly promoting a Stallman-esque freedom. Even the corporation involved is Google, almost certainly the multinational with the best record on issues of access. We'd all be using a few Project Gutenberg texts if Hathi, Internet Archive, and everyone else didn't have their scanned PDFs and those of the projects that tried to catch them.

I have two fears, though. The first, which I've already talked about a bit, comes from the technical side. I'm afraid we might let the perfect be the enemy of the good on issues like OCR quality and metadata. The best metadata and the best OCR are probably going to be the ones with the heaviest restrictions on their use. If the requirement for scholarship becomes access to them, we either tie our hands before we get started, restrict work to only labs that can get access to various walled gardens, or commit ourselves to waiting until teams of mostly engineers have completely designed the infrastructure we'll work with. Things are going to get a lot better than they are now for working with Google or Jstor data. We want to make sure they get better in the ways that best suit humanistic research in particular. And we want to make sure text analysis is a live possibility on messier archives--digitized archival scans from the Zotero Commons, exported OCR from newspaper scanning projects, and so on.

My second worry is that most historians, in particular, just aren't going to get on board the Digital Humanities train until all the resources are more fully formed. As a result, we might not get historian-tailored digital resources until the basic frameworks (technical, legal, etc.) are already fixed. Historians, I do believe, read more of the long tail of publications than anyone. But there's an ingrained suspicion of digital methods that makes historians confess only in hushed whispers that they use even basic tools like Google books; and at the same time, a lot of historians, particularly non-computer-friendly ones, carry an intrinsic sympathy for the makers of books that leaves them not to regard pushing the envelope of copyright law as a fully noble endeavor. (Coursepacks excepted). With physical books, the 1922 seam isn't nearly so obvious as with digital texts; as a result, the importance of pushing the copyright envelope isn't always clear.

So part of the solution for making the archives safe for digital history is getting the profession a bit more on board with digital history, particularly old-fashioned humanities computing, in general. That's doable. Even our senior faculty are not eternally trapped in the icy ninth circle of the Sonny Bono black hole, where the middle head of Satan eternally gnaws on a cryogenically frozen Walt Disney. They just need some persuading to climb out. I talked to Tony Grafton for a while this week about his plans to bring the AHA into the digital age—I think, after all the reports about the reluctance of historians to use anything more than basic tools, we're actually on the verge of getting somewhere. But just how he and and the rest of the vanguard will pull them along is one of the trickiest and most interesting questions in the digital humanities today. That's one of the things I want to start to think about a little more next.


  1. I feel honored to be even indirectly included in the same paragraph as a cryopreserved Walt Disney.