Finding and Archiving the Internet Footprint
With the introduction of home computers and electronic typewriters in the late 1970s, archivists were forced to confront the fact that a person's "papers" might, in fact, no longer be on paper. The power of word processing made writers among the first to embrace information technology outside of government and the financial sector. And because writers often made small purchases and were not constrained by prior investment, they frequently purchased equipment from small niche manufacturers whose technology did not become dominant.
As a result, preserving and cataloging the earliest electronic records consisted of two intertwined problems: the task of finding and copying the data off magnetic media before the media deteriorates, and the challenging of reading older and sometimes obscure formats that are no longer in widespread use.
Archivists are now on the brink of a far more disruptive change than the transition from paper to electronic media: the transition from personal to "cloud computing." In the very near future an archivist might enter the office of a deceased writer and find no electronic files of personal significance: the author's appointment calendar might split between her organization's Microsoft Exchange server and Yahoo Calendar; her unfinished and unpublished documents stored on Google Docs; her diary stored at the online LiveJournal service; correspondence archived on the Facebook "walls" of her close friends; and her most revealing, insightful and critical comments scattered as anonymous and pseudonymous comments on the blogs of her friends, collaborators, and rivals.
Although there are numerous public and commercial projects underway to find and preserve public web-based content, these projects will not be useful to future historians if there is no way to readily find the information that is of interest. And of course, none of the archiving projects are able to archive content that is private or otherwise restricted -- as will increasingly be the case of personal information that is stored in the "cloud."
This paper introduces and explores the problem of finding and archiving person's Internet footprint. In Section 2 we define the term Internet footprint and provide numerous examples of the footprint's extent. In Section 3 we present a variety of approaches for finding the footprint. In Section 4 we discuss technical concerns for archiving the footprint.
Web archiving has received significant exploration in recent years, including the use of proxies to collect data, the need for proper record management, and the difficulty of reconstructing lost websites from the web infrastructure. Researchers have also characterized the Web's "decay". Jatowt et al. have developed techniques for automatically detecting the age of a web page.
Juola provides a review of current authorship determination techniques.
There are numerous open source and commercially available face recognition products, including FaceIt by Visionics, FavesVACS by Plettac, and ImageWare Software. Zhao et al. and Datta et al. have both published comprehensive surveys of current research and technology.
Viegas et al. examined cooperation and conflict between authors by analyzing Wikipedia logs. Other relevant work on Wikipedia includes analysis of participation and statistical models that can predict future administrators.
Consider the staggering range of Internet services that a person uses during the course of a year. Some of these are public publication services like BBC or CNN News -- services that are little more than traditional television, radio or newspaper repurposed to the Internet, and that most Internet users access anonymously. Other services are public and highly personalized -- blogs and home pages, for example. Still other services are private and personal, like an online calendar or diary. These services can be operated by an organization for its employees, such as a company running a Microsoft Exchange server, or they can be operated on a global scale for millions of users, such as Google Calendar.
This section considers the wide range of information that an originator may create in other computers on the Internet through their own actions -- the originator's Internet Footprint.
A person's public identified footprint is any information that they created which is online, widely available, and specifically linked to author's real name.
For originators that are authors, their public footprint almost certainly includes articles that have been published under the originator's own name in web-only publications such as Slate Magazine or Salon.com. The public footprint may also include letters to the editor. (John Updike once wrote a letter to the editor of the Boston Globe advocating that the comics page retain "Spiderman.") Individuals may also publish their own writing on personal web sites ("home pages" and "blogs").
Websites cannot be relied upon to archive their ownmaerial, because the websites may not exist in the future. For example, in the late 1990s thousands of articles and columns by leading writers were published at HotWired, a web property operated by Wired News. Wired News was eventually sold to Lycos, then to Conde Nast. Numerous articles were lost during these transfers; those that are still available online are not at their original Internet location (http://www.hotwired.com), but are now housed underneath the http://www.wired.com domain. Many links to, between and even within the articles have been broken as a result.
One way to retrieve no longer extant web pages is hrough the use of the Internet "WayBackMachine," operated by the Internet Archive. But here there are several problems:
The Internet Archive is itself another organization (in this case a forprofit business) which may cease operation at some point in the future.
The Archive's coverage is necessarily incomplete.
The Internet Archive may not be accurate. (Fred Cohen has demonstrated that the content of "past" pages on the Internet Way Back machine can be manipulated from the future -- a disturbing fact when one considers that the reports from WayBack machine have been entered into evidence in legal cases without challenge from opposing counsel.)
The WayBack machine will not archive websites that are blocked with an appropriate robots exclusion file robots.txt. This was especially a problem for the "Journalspace" online journal, which was wiped out on January 2, 2009 due to an operator error and the lack of backups. As it turns out, Journalspace had a robots.txt file that prohibited archiving by services such as Internet Archive and Google.
Rather than hoping that another organization has managed to sweep up an individual's relevant web pages in a global cataloging of the Internet, it almost certainly makes more sense for archivists to go out and get the material themselves.
The Public Footprint may also contain information at social networking websites such as Facebook, MySpace and LinkedIn. These websites contains not just information that a person posted, but documentation of a person's social network -- their "friends" and associates -- as well as documentation of a person's preferences in the form of "recommendations" messages. Websites such as Flickr and Picassa hold photographs that a person may have uploaded. What a treasure for future historians trying to understand the life of an individual! What a quandary for an archivist, for these websites actively encourage originators to intermix the personal and the professional. Only through consultation with families and other interested parties will archivists be able to determine which "personal" information should be made immediately available, which information should be kept in closed collections until a suitable amount of time has passed, and what should be destroyed.
Finally, a person's public footprint might contain information that the person thinks is private but which is, in act, public. It is notoriously difficult to audit security set ings because they are complex and not generally apparent within today's user interfaces. As a result, it is common for computer users to make information publicly available when they do not intend to do so. Good and Krekelberg explored the Kazaa user interface and discovered that it was relatively easy for individuals to "share" their entire hard drive to a file sharing network when they intended to just share a few documents or folders. Sometimes such inadvertent public sharing can have important political, social, or historical dimensions: in June 2008, Judge Alex Kozinski of the 9th US Circuit Court of Appeals was found to have sexually explicit photos and videos on his own personal website -- relevant, as the Judge was himself overseeing an obscenity trial.
Although not strictly part of the "Internet" footprint, many organizations operate their own data services on which an originator could easily store information. For example, many businesses and organizations run their own web-based calendar and email services. These services may also cause problems for archivists because they can be hard to find and may not be readily interested in sharing their information -- even when the originator or the originator's family strongly favor information sharing.
Beyond the information that a person published under their own name, there is potentially a wealth of information that is publicly available but published under a different name or a non-standard email address -- an electronic pseudonym.
There are may reasons why an individual might publish information to the public using a pseudonym:
Information might be published under a different name in an attempt to preserve privacy.
The individual might have a well-established pen name (for example, Charles Lutwidge Dodgson blogging as Lewis Caroll).
The individual might be a fiction writer and be publishing the information online using the persona of a fictional character (for example, Dodgson blogging as the Queen of Hearts).
The information might appear in an online forum where there is a community norm that prohibits publishing information under a "real name," or the online forum might assign pseudonyms as a matter of course.
Another person might already be using the individual's name, forcing the originator to pick a different name.
The individual might be a government or corporate official and be prohibited from posting under their own name for policy reasons. (For example, Whole Foods President John P. Mackey blogged under the pseudonym Rahobed, a play on his wife's name Deborah.)
Another way to locate the originator's Internet footprint is by searching for it. Two kinds of search are possible. First, the archivist could simply search for the originator's name (or aliases) on Internet search systems such as Google and Yahoo. Second, the archivist could go specifically to websites such as Facebook, MySpace and Flickr, and conduct searches there.
Search is complicated by the fact that many people share the same name. Bekkerman and McCallum note that a search for the name "David Mulford" on Google correctly retrieves information about a US Ambassador to India, "two business managers, a musician, a student, a scientist, and a few others" -- all people who share the same name. Which DavidMulford is the "right" David Mulford depends on which one the context of the search.
Sometimes it is difficult to determine if two seemly different individuals are in fact the same person. Consider again the search for "David Mulford:"
There is an old story of an assistant at MIT who worked for a famous professor in one of the physical science departments. One day the professor died after a long illness. Shortly thereafter, the assistant received a phone call from the Institute Archivist who wanted to stop by and evaluate the professor's papers. The assistant said that she had been expecting the archivist and had already "cleaned them up" in anticipation of the visit. When the archivist arrived the extent of the cleaning became evident: the assistant had thrown out the professor's scratch pads, his doodles, a box of business receipts, and so on, and prepared for the archivist a neat folder showing all of the professor's speeches, published articles, and honors. The archivist was devastated.
Although many archivists know that they may need to act with haste in order to preserve the physical papers of the deceased, this story of the archivist and the assistant is in danger of playing out with great frequency in tomorrow's cloud-based world of electronic records.
For example, photo sharing websites such as AOL Pictures have deleted uploaded pictures that are not viewed after 60 days, or when the owner of the account fails to log in after 90 days. Some services delete photos when monthly fees are no longer paid. Archivists would need to move fast to rescue an originator's photos stored on such a service.