Etext Center vs. Google Books

This post will be much briefer than my previous reflections on the state of digital editing. During the course of that post, I made casual reference to the Google Book initiative that clearly expressed some dissatisfaction and now feel that comment needs further elaboration. While others have offered more substantial critiques of this (admittedly useful) service, my perspective on the situation is perhaps worth a few lines.

From 2003 through 2007, I was a graduate research assistant at the University of Virginia’s Electronic Text Center (Etext Center) and worked extensively on the digitization of our library’s (out-of-copyright) volumes using TEI SGML (and, eventually, XML). Almost all my coworkers, the largest percentage of whom were graduates students in the English department, had learned the art of textual markup on the job and this was by design and not accident. David Seaman, the original director of the Etext Center and also one-time graduate student of English, strongly believed that it was easier to teach someone with a strong humanities background the necessary technical skills to do these tasks than to teach someone from a computer science background how to understand humanities research practices.

My experience since then has largely validated David’s theory, although there are doubtless numerous computer science experts who would be more than capable of understanding what the humanities want from digital resources. Whether there are an equal number who can be bothered to acquire that understanding, though, is an entirely different matter. More often than not, programmers have trouble viewing what humanities scholars do as ‘real’ research because it resists the easy quantification found throughout most hard sciences – Personally, I cannot help but wonder why this isn’t seen more often as a challenge to the logical underpinnings of computer science and an opportunity for innovation rather than an indictment of the validity of humanities research.

In the end, this same lack of respect and understanding for the concerns of humanities research is the flaw behind many of shortcomings in Google books. Since Google’s digitization and metadata management decisions have been made primarily by those with a computer science background, they have often failed to anticipate obvious problems in their methodology. No doubt the company’s defenders will point out that librarians and those with similar backgrounds have been consulted extensively during Google books’ development, but there’s a difference between a project managed by humanists and one that only solicits the advice of humanists.

Consider the issue of older works existing in multiple editions (or, at the risk of causing some engineer’s head to explode, multiple printings within a single edition created prior to electronic type-setting, each with minor corrections that have not been systematically recorded). Faced with these circumstances, even the most inexperienced student of textual criticism would realize the importance of carefully recording detailed bibliographic information for each digitized text and the desirability of having access to alternative versions of the work in question. Indeed, they would recognize the distinction between “text” and “work” that is at play in my last sentence. Those in charge of Google’s project, however, clearly don’t appreciate such issues – either being completely unaware of them, or calculating that the cost of such diligence would outweigh the benefit given their intended audience.

I promised to be brief, so let me finish here by admitting again that Google is creating a very useful resource – however, it is not a scholarly resource in its current incarnation (just as the recent interest in digital distribution among publishers mentioned in my last posting has nothing to do with digital editing). Without an appreciation of those issues that matter to humanities researchers, especially textual critics given the digital content in question, it can really only be useful for the most basic queries for which “any old copy” of a work will do. This situation was no doubt inevitable given the scale of Google’s ambitions, but that doesn’t make it less unfortunate. Google books, in my opinion, proves David Seaman’s theory about who should be the driving force in the digital humanities on a grand scale.

N.B. Just a final note about Google Books only tangentially related to my main point here: While browsing texts referencing Sir Gawain and the Green Knight in their collection, I came across A. C. Spearing’s Criticism and Medieval Poetry. Google claims the book was published in 1873, despite the fact that Prof. Spearing seemed rather too spry when I sat in his class a decade ago to have published anything in the nineteenth century. What was the impetus for this obvious error? Barnes & Noble’s declaration of their founding date on the title page of Spearing’s book (check the image here, assuming it has not yet been removed). It makes you wonder how many other “1873” B&N publications are knocking about on Google Books.

The Editing Tool Fallacy

Wed, Jun 2009 medievalProf 1 comment

[Some drafty thoughts on digital editing; comments welcome.]

Much like Peter Robinson in his 2005 article for the online digital medievalist journal, I’ve been contemplating recently the future prospects for digital editions of medieval texts. For me, it is a subject of some urgency and professional import: at this moment, I am editor of one such edition (the alliterative Morte Arthure), co-editor of two more (the Siege of Jerusalem Archive and one manuscript of Piers Plowman Electronic Archive), as well as production editor on several other electronic initiatives for Yale University Press. If digital editing fails, as Robinson suggests it might, I’m not going to have much of a future in the academic or publishing world.

As Robinson correctly notes, despite the initial excitement that greeted digital editions over ten years ago and the completion of several fine examples that far surpass their print counterparts, the future of digital editing was not very bright in 2005. This has remained true even though there is increased interest in digital publication among both academic and commercial presses – an interest only further piqued by the financial stresses of the last year. Digital publication, however, is very different from the more scholarly enterprise of digital editing (at least as Robinson understands it). The former is strictly concerned with the medium of dissemination while the underlying structures used to achieve that (XML, PDF, HTML) are largely irrelevant; the latter presumes that the underlying markup will follow disciplinary standards and represent a scholar’s critical engagement with the edited work. In other words, the new interest among publishers in churning out eBooks for the Amazon Kindle created using Pdf and dodgy OCR programs bodes just about as well for serious digital editing as Google books does for serious efforts at mass digitization.

Looking at the disconnect between the clear superiority of existing digital editions and the interest of other scholars in continuing this work, Robinson hits upon the idea that the problem is a lack of intuitive tools for producing digital editions – something that would eliminate the need for textual scholars getting their hands dirty with markup or (heaven forefend!) basic scripting. This is clearly an extension or repurposing of observations that John Unsworth has been making for years (see here and here). Unsworth recognized that electronic content, whether facsimile images of texts or XML transcriptions, would not be enough to spark interest in the digital humanities without software exploiting that information either more efficiently or in ways impractical with print resources.

Unsworth, of course, is absolutely correct. Innovative tools for data mining, image and text manipulation, statistical analysis, and much else are necessary to drive interest in the digital humanities as a whole. There has been progress on this front, though, and tools for many of these tasks are available now for those willing to invest the time and effort needed to master them. Tools for digital editing are perhaps less polished than some of the linguistic analysis and statistical software developed over the years, but they do exist and seem to require no more (or less) effort to master. I take this to be the first clue that Robinson may be incorrect to ascribe the lack of general interest among scholarly editors in the digital medium to a dearth of tools.

Another reason to doubt Robinson’s contention is the equally depressing lack of interest among scholars (both textual and literary) in using digital editions, something he himself comments on in his article. However, he fails to provide an argument explaining why a lack of tools for creating digital editions would lead to this broader resistance. After all, does anyone believe that only scholars capable of producing a print edition are willing to make use of such publications? One could conceivable argue, on the other hand, that scholars interested in producing digital editions first need to be willing to make use of them. So, rather than asking why no one wants to make digital editions anymore, we should be asking why no one has ever wanted to use them.

Those involved in creating digital editing have often blamed this resistance on entrenched prejudices in the (procedurally) conservative world of academics. To a certain extent, this is likely true. Some scholars, especially those who have spent a long career using print resources, would undoubtedly be hesitant to adopt digital editions. On the other hand, the popularity of other new technologies means this cannot be the only reason for their lack of interest. Digital editions have been falling out of favor at the same time that everything from blogs to twitter have managed to capture the interest of those in academic circles. The exploitation of these technologies (especially those with some kind of ‘social’ aspect) seems to disprove the notion that university departments are full of unreformed luddites.

One could perhaps argue, though, that scholars adopt these other technologies more readily because less intellectual rigor is expected of them. That is to say, a scholar could make use of twitter or blogs without making much of a value judgment – if he finds these things convenient and useful there’s no reason to hesitate. Editions, on the other hand, must be trusted as accurate (at least in terms of their stated rationale) and intellectually rigorous before scholars will cite them in their research. For instance, no serious scholar should cite something found in Google books given its inconsistent quality and lack of bibliographic metadata even if they consulted this resource as a first step in their work. Perhaps digital editions are seen in similar terms and simply haven’t achieved the necessary level of trust to be widely adopted as legitimate sources for textual citations. Arguing against this notion, though, is the obvious effort among those who have produced serious digital editions in the past to have their work vetted and distributed by traditional publishers with solid reputations. In addition, assuming Robinson is correct in his positive judgments about the quality of currently available digital editions (and I believe he is), this objection should be weakened over time rather than becoming ever more persuasive.

If the lack of tools for creating such publications and general resistance to technology cannot explain the lack of interest in digital editions, though, what can? After giving this question quite a bit of thought, I have now settled on two possible explanations that conform better to the facts than Robinson’s argument.

  • Lack of Function

While I agree with Robinson that digital editions are superior in many ways to their print counterparts, from the display of variant readings to representations of codicological detail, this doesn’t really represent new functionality. G. Thomas Tanselle has made this point repeatedly in his refutations of the hyperbole that surrounded the advent of digital editing – regardless of medium, the questions that scholars ask when creating editions and the resulting choices they make have not really changed much. Furthermore, much of the practical improvements offered to date by digital editions are of use to only a small fraction of their intended academic audience. How many literary scholars, after all, need access to facsimile images and full transcriptions of multiple witnesses? How many textual scholars need such information? Of those textual scholars who do care about these things, how many would be satisfied with even the most brilliant digital facsimile when it’s still possible to spend a few hours with the original in a library’s reading room?

Digital editions need better functions – tools like those Unsworth had in mind for the manipulation and analysis of digital resources and not tools for the creation of digital editions as Robinson would have it. We need to ask ourselves what do people really do with scholarly editions and what they would like to do that they cannot within the confines of print. I don’t think we’ve spent much time asking this question and we certainly haven’t figured an answer out yet – when we do, the benefits of adopting digital editions should overwhelm any lingering Luddite tendencies or fears about the medium’s authority. In short, digital editions must be made to function in ways that will make them indispensible to scholars.

  • Lack of Form

The form to which I refer here is not the underlying markup structure of digital editions – indeed, one of the biggest successes of the past few decades in the digital humanities has been the Text Encoding Initiative’s guide to text markup and its widespread adoption. Instead, I’m referring to the form in which end users interact with the digital editions. In remarking on the less than enthusiastic reaction of traditional publishers to digital editions, Robinson comes close to the truth – without the support of those publishers, who have always been responsible for the packaging of research in the codex format, digital editions have lacked consistent presentation. Even when publishers have been willing to attach their name to a project, as with OUP or Michigan, they’ve lacked the technological expertise to impose any kind of ‘house-style’ on these publications. This has left presentational issues to the individual scholars editing the texts, each of whom has arrived at very different solutions that have left their potential audience with only a fuzzy notion of what even qualifies as a ‘digital edition.’

While I firmly believe, contrary to Robinson, that any serious scholar embarking on a digital edition needs to familiarize himself with standards like the TEI guidelines (the application of which is an act of textual criticism), decisions about presentation/dissemination need the standardization traditionally imposed by a separate publishing agent. If decisions about presentation fall to scholars, or the creators of eccentric editing tools, there will never be an opportunity for clear publication standards to emerge. Do we have publishers capable of developing these standards and a delivery system that can impose them? No. The combination of technological experience, sensitivity to scholarly priorities (including those related to analytical markup), and commitment to open source software is just not there among publishers right now. It may be that traditional publishers, with their focus on monetization, will never be prepared to fulfill this role and it will be left to libraries and learned societies (perhaps in partnership with those traditional publishers) to act as de facto publishing agents by developing those specifications and dissemination platforms.

In a sense, Robinson seems to have the answer completely backwards. While he believes there should be self-publication tools to distance editors from the underlying markup of their editions, it seems clear to me that serious digital editors will want to be fully in control of every aspect of mark-up but should not need to worry about publication solutions at all. In other words, scholars should be able to provide publishers with files with valid markup – obviously TEI XML reflecting their critical judgments for text – and let publishers worry about disseminating those files along with standard sets of applications for their manipulation and analysis (see my comments above about the importance of improving functionality).

Some conformity of presentational design, even if it is not absolute, should have a profound effect on the acceptance of digital editions by wider academic audiences. Consider the following contrast: when one picks up a text published by the Early English Text Society, you know that the object you hold will fulfill certain expectations in terms of form and scope (if not always quality). Can we honestly say the same thing about digital editions? Again, this is not a demand that there be absolute conformity in function/format – some applications or presentation choices will make more or less sense depending on the texts being editing – but there should be a minimal set of shared conventions associated with the term ‘digital edition.’

In summary, function and form from the perspective of the user of a digital edition are what matter, not tools for creating digital editions. Editing has always been a task requiring an obscene investment of time and energy and I simply don’t buy Robinson’s argument that being required to learn markup is enough to drive off potential editors (even print editions, it should be noted, have what can be considered an arcane markup system). In order to become a potential editor, though, a scholar needs to have a clear sense of what a digital edition is and why it is useful. Until digital editions come packaged in a standardized format that allows for functionality not equivalent to what is available in print editions this will not be the case.

Final Note: It occurs to me that one might object (to both Robinson and myself) that, “to a hammer, every problem looks like a nail.” After all, Robinson has spent years building tools for editors (including Anastasia and Collate) and naturally sees yet more tools as the most sensible solution to problems with digital editing. Likewise, I’ve been working for the past two years in academic publishing and it’s inevitable that I would focus more on how that industry could help alleviate those problems. Despite that awareness of prejudice, though, I still believe Robinson (and others, including Kevin Kiernan) is wrong to think scholars would produce more digital editions if only the tools to do so were easier to use. Go ahead and build these tools, but let’s not fool ourselves about how this will affect the wider acceptance of digital editions. Scholars will gladly do difficult things if they’re convinced the result is worthwhile – that’s the real battle and the one that should be our primary focus.