October 02, 2005

Lost and Found in Cyberspace

[The following is the text of my remarks from the University of Maryland Library’s Bits and Bytes symposium last Thursday. This is a draft version of the talk, which I’m currently polishing up to publish—comments and challenges appreciated.]

Good afternoon. I’m neither a librarian nor an archivist, but as someone who makes the study of texts and their attendant technologies their professional business I want to make a few points about the true nature of electronic documents; and thereby contribute my perspective to today’s discussions. I said a few points, but really it’s just variations on a theme which I’ll give to you up front: for every problem electronic documents create—problems for preservation, problems for access, problems for cataloging and classification and discovery and delivery—there are equal—and potentially enormous—opportunities.

That’s a point that may seem breathtakingly obvious to those of us in this room today, but it’s still not a point, I think, grasped by the public at large, or even sub-sets of the public whose interests overlap with our own. “Literary Letters Lost in Cyberspace,” laments the New York Times Book Review in an essay a few weeks ago by Rachel Donadio, one of the paper’s writers and editors (September 4, 2005). The gist of the piece is that as more and more correspondence between authors and their publishers shift to email, an important body of scholarly primary source material—crucial to literary criticism and biography, textual editing, and historical study—is jeopardized.

The article gets some important things right. It points out that with the rise of email the volume of correspondence between authors and their editors has not diminished but rather accelerated dramatically. This presents more, not fewer opportunities for scholars and biographers, though the essay does assert that most email is written in a careless, ephemeral style, a point which I’d dispute—because I write email too, a lot of it in fact, and while some of it is careless and ephemeral much of it carefully thought out and edited, particularly when I send it to my publishers. More to the point, the article repeatedly frames the issue in terms of a dearth of personal or organizational protocol for archiving email, rather than technical obstinacy and hardcore digital preservation issues. Donadio mentions the case of Deborah Treisman, the New Yorker’s fiction editor: “‘Unfortunately, since I haven’t discovered any convenient way to electronically archive e-mail correspondence, I don’t usually save it, and it gets erased from our server after a few months,’ Treisman said. ‘If there’s a particularly entertaining or illuminating back-and-forth with a writer over the editing process, though, I do sometimes print and file the e-mails.’ The fiction department files eventually go to the New York Public Library, she said, ‘so conceivably someone could, in the distant future, dig all of this up.’”

Here we see preservation framed as a fundamentally social rather than a technological challenge. This is right I think. But the article fails to draw the obvious conclusion, which in this case is simply that Ms. Treisman should start systematically saving her email and that her systems people should start backing it up every night rather than wiping it every few months. Likewise, when novelist Zadie Smith laments, “I have a normal Yahoo account that saves e-mails instantly, but not to the hard drive. I’ve e-mailed Yahoo and asked how you can save all your own e-mails onto disk or whatever, but I get no reply,” the solution is to switch to a different service provider, one who will allow her to download all of her 12,000 messages to and from agents, editors, and other literati to her own personal files, which she can take ownership of. Smith also points out that she no longer has any drafts or early versions of her works in progress since she simply executes a save command that overwrites any earlier copy of a file. The truth is those earlier versions no doubt due exist somewhere in the flotsam and jetsam of her file system, but she need not resort to high-end recovery tools to get at them. She just needs to start saving her versions. In fact, as storage costs continue to plummet and personal hard drives begin to edge up to the terabyte threshold, saving every state of every file is likely to become routine, the default. It will cost more to take the time to search a file system, locate a file, and then overwrite it than it will to simply keep it somewhere on a hard disk whose aerial density is at something 10,000 tracks and sectors per square inch.

There’s no question that certain things are lost when documents are prepared and transmitted in electronic formats. The texture, heft, even smell of the paper, the coffee cup’s stain, the crinkled edges and dog-eared pages, the physical abrasions of marks and erasures. Let’s think for a moment about what’s gained though. By opening my word processor’s Properties window I can ascertain, to the date- and millisecond, when the file was first created and when it was last edited. I can count the number of words and characters, but more interestingly the number of minutes spent editing the document. This is the kind of information scholars and editors of the literary classics would weep to have. How long was Coleridge really at work on “Kubla Kahn” before he was interrupted by the man from Porlock? The point here is that electronic objects are self-documenting to a remarkable degree, and this is a phenomenon that can and should be exploited as new social and technological practices evolve to preserve them.

Since the debut of Word 97, users have had the ability to “track changes” in their documents, the software automatically logging each and every addition and deletion, as well as changes in formatting—all date- and time-stamped. In fact, one common workplace gaffe is to send a client or correspondent an electronic copy of a document with the view of the changes turned off, but the changes themselves not yet accepted or rejected—thereby displaying the word-by-word processing of the text to its intended recipient as soon as the view of the changes is turned back on in their own copy of Word. Wikis, notably the Wikipedia, offers similar systems, where the version and revision history of the document is transparent to all who access it. I can find out much more about who contributed what to a Wikipedia article when than I ever could with a printed reference work often authored anonymously or by committee. Fixing such version histories is a trivial computational undertaking, not least because at the center of a CPU you’ll find not teeming rows of ones and zeros but a crystal clock.

In terms of challenges to future historians, Donadio cites Steven Kellman who has just written a new biography of Henry Roth; he suggests, rather indisputably, that “Our understanding of the Constitution . . . would be quite different if the thoughts about it exchanged by Jefferson, Madison and Monroe had vanished into the electronic ether.” True enough. But there’s nothing inherent in the technology that makes email especially susceptible to vanishing into the electronic ether. On the contrary, as Oliver North and other malefactors have found out, the stuff is remarkably pesky and hard to expunge. A single email message may leave traces of itself inscribed on a dozen different servers as it makes its way across the network, a potential for proliferation that is further exacerbated by backup services at each site. While I don’t mean to minimize the very real technical challenges in the realm of digital preservation, it’s worth remembering that email and other textual forms have it easier than with other media since often we’re dealing with ASCII and XML rather than binaries and proprietary formats.

Consider what a treasure trove if Jefferson, Madison and Monroe had been corresponding via email and their messages had been systematically archived. This is a feature routinely associated with listserv technology and other electronic mail systems. In the case of the William Blake Archive, one of the scholarly text and image encoding projects begun in the mid-90s at the Institute for Advanced Technology in the Humanities at the Virginia, we have ten years and around 10,000 messages worth of correspondence between the editors and project staff on every facet of the project’s development, large and small, all date- and time-stamped, threaded, hyperlinked, and network accessible to the project participants.

The plight of novelists like Zadie Smith who fear that computer use is keeping them from saving earlier versions of their work is particularly poignant. The irony is that versioning is a hallmark of electronic textual culture. Not only in a principled or potential way, but as a thriving industry of document management systems, file comparison utilities, and so-called version control or concurrent versions systems, the archetype of which, CVS, originated in the mid-1980s when a Dutch computer scientist crafted a repository structure that would allow himself and his students to work on a large distributed programming project without overwriting one another’s code. While most prevalent in software development, especially open source projects with their frequently globally dispersed contributors, a version control system like CVS (or its heir apparent, Subversion) is capable of managing any kind of data, textual or otherwise. A version control system retains every state of every file checked in or out of the repository, and is capable of stepping back through the decision trees that result to access the data at any point in its revision history. In essence, what these systems represent are temporal extrusions of the immediate documentary event on a user’s screen.

I’ve just embarked on a new project that asks poets and fictions writers to contribute original material to just such a CVS repository, checking their work in and out of the repository structure each and every time they wish to edit or compose. The result will be a Web-accessible archive of the creative process, with the full text of each and every version of a writer’s text available for reading, much as though one were looking over the writer’s shoulder as he or she worked. I’ve sent invitations to a number of established contemporary writers in an effort to get this off the ground. Want to see how Robert Pinsky writes a poem? You’ll be able to. Assuming, that is, he answers my email.

Posted by mgk at October 2, 2005 04:16 PM
Comments

I had the rare experience (for me) of reading that NYT Book Review piece on paper, when it came out, and this is just the thoughtful response that I'd hoped someone (probaby you, I thought) would provide.

I bet Robert will answer your email, but I bet he'll be reluctant to exchange his current version control system for CVS! The versioning project is a great idea, and it's good to start the experiment with some existing system. But there are a lot of issues that relate to the specific software used, not just its overall purpose, as you're well aware. What if I sent out a call to get a bunch of literary writers, poets, novelists, etc. together to write literary COMPUTER PROGRAMS, a new and exciting opportunity, and then I said that everyone would be learning to program in PowerBuilder?

...for those luck enough to be ignorant of this system:

http://www.sybase.com/products/developmentintegration/powerbuilder

Posted by: nick at October 8, 2005 07:12 PM | Link to Comment

Matt, I suppose one of the purposes of the remarks and the draft versions of the talk being polished for publication is to reassure folks concerned with such matters that electronic files can be preserved. It is the very same capabilites that enable filtering of the material destined to the archive and thereafter the electronic medium can be used to offer choices in what and how material archived can be accessed. I'm intrigued by the shoulds. We can save; should we? We can delete, should we? I see the remarks as calling for a more conscious use of the computing machine -- both as storage device and as a means to access records. I wonder how the imperatives would be marshalled if the remarks began with the statement about the crystal clock at the center of the CPU. Is this not the heart of the matter: time? Actually it is perhaps more a question of time investments. Your remarks make me realize that forgetting (and important stage in synthesize information into knowledge) may require its own investements of
time. A quibble... the number of minutes a file is open is not necessarily the number of temporal units spent editing. Attention can be on other open windows or evey away from the keyboard or input device. That particular bit of metadata is a machine view of the file. The human view may be different. Still it is significant to determine the time between opening and closing the file: it indicates its availability in sort sort of staging area. And a question: are electronicl objects self-documenting? Might it be more accurate to describe then as open to linking to metadata? For me, it is not the object that is self-documenting. It doesn't have that much autonomy. It is the relations between an object, other objects and an environment that lead to documentation about the states of those relations. What is being preserved is not just the object itself but the object as a trace of relations, social and technological.

The remarks make a somewhat seamless transition between the records management needs of individuals and institutions. The clock as timestamper and the clock as timer might provide a way of tracking the shift from individual to institution (from personal to historical) and align the question of preservation with one of transition. For in some ways the clock as timer helps manage the movement. Zadie Smith with an eye on the clock just might be a Zadie Smith able to manipulate her browser to have those emails from the free account provider saved to another place. In a sense I believe what you are calling for, Matt, is the time investment in learning how the machines work so that the records of interest may have a greater chance of being preserved, that is making the passage from personal and commercial routes to the institutions that tend to the archives and history making.

Posted by: Francois Lachance at October 10, 2005 04:53 PM | Link to Comment

Francois, that's really fabulous and generous feedback. Thank you. I will put it to good use.

Posted by: MGK at October 10, 2005 08:21 PM | Link to Comment

This probably won't be recieved due to a glitch in the preferences of my system. too many of us do not take the time to compose our thoughts in our processors before placing them in e-mail. My preference when i am taking my correspondence seriously is writing in a word processing program and due to my system glitch, copying into the e-mail composer. I then go back to the WP and edit, add to, or incorporate material into other manuscripts, in the same way perhaps that I maintain several pulp notebooks for long-hand jots.
I think that we take ourselves too lightly when responding directly in e-mail- I've read messages that were no better than machine generated spam.

Posted by: Cece Fran at October 11, 2005 11:18 AM | Link to Comment

Matt -- Well, it's taken me almost an entire month to comment on your thoughtful remarks at the Bits & Bytes symposium -- thank you for inspiring deeper thinking about records management. Though I see that Francois has already raised a quibble along the lines of mine, I will simply say that the information gained from the "Properties" window may or may not be as accurate as your remarks claim: "By opening my word processor’s Properties window I can ascertain, to the date- and millisecond, when the file was first created and when it was last edited. I can count the number of words and characters, but more interestingly the number of minutes spent editing the document." Francois has already commented on the fact that you can't really know the number of minutes spent editing, and I will add that you may not know when the text in question was in fact "first created." Sure, you can know when the file you are reading was first created, but that may or may not have a correlation to the creation of the text. For example, a text contained in a file that I have just renamed would appear, if someone looked at the "Properties" file, as if its date of creation is today. In fact, the file was simply renamed today. Nothing else in the file has changed for the past three years, though the "Properties" window tells you, in two different places, that the file was created and modified today (and that's only true for the file name, not for anything else in the document).

One might ask why I, a writer, critic, and editor with passionate interests in creative process, would not go to great lengths to document file migrations, renamings, etc. The fact is (and this speaks to something also noted above) that when allocating my time, I'd much rather devote it to the acts of creation themselves and what organizational strategies I need to maintain in order to advance the work -- if that means the "Properties" window does not render reliable information, then so be it. Perhaps I'm not being clear, and the following example from today's work will help: I just regularized the file names for different versions of a document I've been working on for the past nine months. If one looks only at the information in the "Properties" window, it appears that version 3 was created and modified AFTER the final (11th) version. The dates on the file names recount the order of revisions of that particular document.

Oh well, this is an overly long way of saying not all data is as objective and precise as those milliseconds make it appear.

Posted by: Martha Nell Smith at October 30, 2005 06:25 PM | Link to Comment

You know, it's funny--I'd been tempted to close down comments on this entry for a while because of blog spam but have been resisting in hopes that more valuable feedback came in. I'm glad I did.

Martha, you are of course correct in all your specifics about how Properties works. But rather than trumpet the infallibility of any specific technology (and maybe I *did* do a bit too much of that) I think that in writing this short piece my objectives were twofold: first, to stress that preservation is ultimately social rather than technological--while recognizing the very real techninical hurdles we nonetheless need to start making smart decisions about how we work with our data and not simply write its longevity off to the conventional wisdom that digital data is ephemeral; and second, to make the point that we can *model* electronic documents any way we choose, and to the extent that current electronic records are fragile, deceptive, etc. this is at least partly the result of implicit decisions made in the modeling. We see this if we look at various content management systems and the way they enforce stricter editing regimes. I think the kind of data captured in Properties is in many ways just a glimpse of the potential, for better *and* worse, that electronic texts have for a revolution in textual practice.

Posted by: MGK at October 30, 2005 06:47 PM | Link to Comment
Due to the proliferation of comment spam, I've had to close comments on this entry. If you would like to leave comment, please send email to me at mgk =at= umd =dot= edu. Thank you.