Skip to main content
Dissertation
Lazy Preservation: Reconstructing Websites from the Web Infrastructure.
(2007)
  • Frank McCown, Ph.D., Harding University
Abstract
Backup or preservation of websites is often not considered until after a catastrophic event has occurred. In the face of complete website loss, webmasters or concerned third parties have attempted to
recover some of their websites from the Internet Archive. Still others have sought to retrieve missing
resources from the caches of commercial search engines. Inspired by these post hoc reconstruction
attempts, this dissertation introduces the concept of lazy preservation– digital preservation performed as a result of the normal operations of the Web Infrastructure (web archives, search engines
and caches). First, the Web Infrastructure (WI) is characterized by its preservation capacity and
behavior. Methods for reconstructing websites from the WI are then investigated, and a new type
of crawler is introduced: the web-repository crawler. Several experiments are used to measure and
evaluate the effectiveness of lazy preservation for a variety of websites, and various web-repository
crawler strategies are introduced and evaluated. The implementation of the web-repository crawler
Warrick is presented, and real usage data from the public is analyzed. Finally, a novel technique for
recovering the generative functionality (i.e., CGI programs, databases, etc.) of websites is presented,
and its effectiveness is demonstrated by recovering an entire Eprints digital library from the WI.
Publication Date
December, 2007
Degree
Ph.D.
Field of study
Computer Science
Department
Computer Science
Advisors
Michael L. Nelson
Citation Information
Frank McCown. "Lazy Preservation: Reconstructing Websites from the Web Infrastructure." (2007)
Available at: http://works.bepress.com/fmccown/14/