"Lazy Preservation: Reconstructing Websites from the Web Infrastructure." by Frank McCown, Ph.D.

Dissertation

Lazy Preservation: Reconstructing Websites from the Web Infrastructure.

(2007)

Frank McCown, Ph.D., Harding University

Link

Abstract
Backup or preservation of websites is often not considered until after a catastrophic event has occurred. In the face of complete website loss, webmasters or concerned third parties have attempted to
recover some of their websites from the Internet Archive. Still others have sought to retrieve missing
resources from the caches of commercial search engines. Inspired by these post hoc reconstruction
attempts, this dissertation introduces the concept of lazy preservation– digital preservation performed as a result of the normal operations of the Web Infrastructure (web archives, search engines
and caches). First, the Web Infrastructure (WI) is characterized by its preservation capacity and
behavior. Methods for reconstructing websites from the WI are then investigated, and a new type
of crawler is introduced: the web-repository crawler. Several experiments are used to measure and
evaluate the effectiveness of lazy preservation for a variety of websites, and various web-repository
crawler strategies are introduced and evaluated. The implementation of the web-repository crawler
Warrick is presented, and real usage data from the public is analyzed. Finally, a novel technique for
recovering the generative functionality (i.e., CGI programs, databases, etc.) of websites is presented,
and its effectiveness is demonstrated by recovering an entire Eprints digital library from the WI.

Disciplines

Databases and Information Systems

Publication Date

December, 2007

Degree

Ph.D.

Field of study

Computer Science

Department

Computer Science

Advisors

Michael L. Nelson

Citation Information

Frank McCown. "Lazy Preservation: Reconstructing Websites from the Web Infrastructure." (2007)
Available at: http://works.bepress.com/fmccown/14/