Wednesday, May 09, 2007

On Google Books

Several posts in the last few weeks on the subject of Google Books (as well as some personal experiences with the site) have prompted this post, in which I will consider what I see as the major pros and cons of the Books Project as it's currently manifested and/or as it may manifest itself in the future.

Michael Lieberman wrote recently at Book Patrol that the buzz at the London Book Fair was all about two future Google initiatives (more on both here): "a book rental program that will let you rent the content of a book on a weekly basis" and a "book retail program that will allow users lifetime access to the texts they purchase." These potential programs (which apparently may appear before the end of 2007) will apply to books still in copyright; those out of copyright will remain freely downloadable and accessible to the extent that they already are/will be.

Michael's very concerned about the impact of these projects, and as a bookseller I suspect he's right to be; in many cases, people will be more likely to 'rent' a book (practically the equivalent of borrowing it from a library, but with what I presume would be full-text searchability - always handy) than go out and buy it from a bookseller. Putting on my researcher hat, I might be tempted in some limited circumstances to do that myself (but since I'm afflicted with the rather odd, antiquated notion that prompts me to buy any of the books I'm using for research it probably wouldn't have a tremendous impact on me).

By making some sort of revenue-sharing deal with with publishers that will allow copyrighted titles to be part of the "rent or buy" program, Google will probably quiet much if not all of the publisher hubbub surrounding the project (that whole "what's in it for me?" question being resolved to the publisher's advantage, at least partially). True, these programs are not 'altruistic' in that they'd ask searchers to pay for content, but on copyrighted materials, paid access would - in a pinch - be better than the non-access some people have at present. Not altruistic, but still a useful service all the same. Michael's right, however, that the impact on booksellers may be unpleasant.

Another critique of Google Books in its current form comes from Robert Townsend at the AHA Today blog: "Google Books: What's Not to Like?" He's got several major criticisms, all of which are quite valid: poor scanning quality, faulty metadata, and truncated public domain.

It's quite common to find several pages within a scanned item which are either wholly or partially unreadable. Common, and annoying. It remains to be seen how Google is responding to reports of unreadable text, duplicate or missing pages; I have not seen any reports of 'repairs' being made yet (please correct me if I'm wrong).

Faulty metadata is another reasonably common and very frustrating problem. Particularly for publications which comprise parts of a series or were issued periodically, the metadata on the particular copy scanned and available is often just plain wrong. We've seen this very often with copies of the Proceedings and Collections of the Massachusetts Historical Society, where just one metadata record was created for the many different volumes that have been scanned (entirely unsystematically of course). This is, theoretically, a fairly simple matter of connecting correct metadata records to each title, but it will still be a huge project and would be better done from the front end rather than retroactively.

As Townsend notes, this often bleeds into the third problem - faulty metadata often means that items which are in the public domain aren't being made fully available because the date attached to the record is incorrect (again I can use the Proceedings example - though the vast majority of those volumes scanned are in the public domain, they are only available in snippet view at present because the date in the single metadata record is from a copyright-protected year). So when researchers have found the snippet, it's a matter of having to reverse-engineer the search and then try to determine from the snippet what actual issue the text is from so that we can get the information to the researcher.

In his post, Townsend points out the difficulties of going back and fixing things once they've been done, and he's quite correct to ask "what's the rush?" Isn't it better to take more time and do it right than to tromp ahead willy-nilly and then have to either go back and fix things on a huge scale or leave incorrect and useless materials in place?

Commenters to Townsend's post have also noted the annoying nature of the current "snippet" view, which often doesn't even show you the search terms it says are there (I've found it tends to display the next sentence, or an adjacent line).

All these criticisms made, however, I think Google Books is fast becoming one of the most useful tools on the internet, both in terms of preservation and access. I see it as a sort of ever-expanding character-string index, and have used it myself to great effect on several research projects ... finding instances where my subjects were mentioned that don't appear in indexes, OCLC subject headings or other traditional sources. Sometimes they weren't full text, and sometimes they weren't even visible in snippet view and prompted a trip to the library or an ILL request - but in all cases they were leads to things I might not have found otherwise.

For reference, I've found in several cases that I'm able to answer an off-site researcher's question by pointing them to a scanned Google Books page or full-text source when they aren't able to come into the library and use our copies of the material (many of our research requests at MHS are from non-local researchers). And for preservation, I have in at least three cases recently been able to save photocopying fragile materials by pointing the researcher to Google Books (first making sure, of course, that the entire text they needed was readable).

On the whole, I think Google Books is and will continue to be an invaluable resource. No, it's not perfect, and yes, clearly they've made some serious and regrettable mistakes along the way (which hopefully can be mended). All concerned should certainly continue to push for improved quality control, enhanced metadata and better responsiveness from the Google folks, and as Michael notes we should beware of future actions which may sacrifice usefulness to profit-making. But as far as I'm concerned, the benefits far outweigh the deficiencies and, with any luck at all, things can only improve from here.