At the Core
This article:
* Discusses current Web preservation efforts
* Defines a risk-based preservation management program
* Introduces Cornell University's Project Prism
Actuaries spend their careers figuring out what benefits a company should
Archivists and research librarians interested in preserving Web resources face a similar challenge. Libraries increasingly depend on digital assets they neither own nor manage. Academic libraries have dramatically increased their offerings of online resources. A 2001 survey of the 21 members of the Digital Library Federation revealed that 40 percent of their costs for digital libraries in 2000 went for commercial content. The big-ticket items were electronic scholarly journals that libraries license rather than own. Yet little direct evidence shows that publishers have developed full-scale digital preservation capabilities to protect this material, and research libraries continue to purchase the print versions for preservation purposes. However, none appears ready to forgo access to the licensed content just because its long-term accessibility might be in question.
Research libraries also are including in their catalogs and gateways more open-access Web resources that are not covered by licenses or other formal arrangements. A spring 2001 survey of Cornell University's and Michigan University's Making of America (MOA) collections revealed that nearly 250 academic institutions link directly to the MOA collections, although neither university has committed to provide other entities with long-term access. Similarly, a review of the holdings of several research library gateways over the past few years indicates growth in the number of links to open-access Web resources that are managed with varying degrees of control. Approximately 65 percent of the electronic resources on Cornell's gateway are unrestricted, and additional open resources are included in aggregated sets that are available only to the campus community. In contrast, only six percent of Michigan's electronic resources are open-access materials.
Current Web Preservation Efforts
Estimates put the average life expectancy of a Web page between 44 days and two years, and a significant proportion of those that survive undergo some change in content within a year. Since 1998, Online Computer Library Center's (OCLC) Web Characterization Project has tracked trends in growth and content of the publicly available Web space. One of the more revealing statistics, IP address volatility, identifies the percent of extant IP [Internet Provider] addresses from one year to the next. In a fairly consistent trend since 1998, slightly over half (55-56 percent) the IP addresses identified in one year are still available the next. Within two years, a little over a third (35-37 percent) remain. Four years later, only 25 percent of the sample 1998 IP addresses could be located, according to OCLC.
OCLC's annual review points to the instability of Web resources; it doesn't indicate whether those resources still exist elsewhere on the Web or whether the content has changed. While some resources disappear, others become unfindable due to the well-known problem that URLs change. A recent preservation review of the 75 Smithsonian Institution Web sites noted that an exhaustive search could not locate a copy of the first Smithsonian Web site, created in 1995. A URL may persist while content changes wildly: the editors of RLG DigiNews discovered that links in several past issues pointed to lapsed domain names that had been converted by others into pornography sites.
Much attention has been paid to unstable URLs and to creating administrative/preservation metadata, but to date no evidence suggests that research libraries are privileging open access sites that utilize some form of URN [Uniform Resource Name] or that document content change.
With the growing dependence on external digital assets, libraries and archives are undertaking some measures to protect their continued use of these resources. Efforts can be grouped into three areas: collaborating with publishers to preserve licensed content, developing policies and guidelines for creating and maintaining Web sites, and assuming archival custody for Web resources of interest.
Licensed Content
Publishers are developing their own preservation strategies as they realize the commercial benefits of creating deep content databases. Several are working with third parties to back up, store, and refresh digital content. OCLC recently announced the formation of the Digital and Preservation Resources Division to provide integrated solutions for creating, accessing, and preserving digital collections. With planning grants received in 2001 from The Andrew W. Mellon Foundation, seven research libraries and key commercial and scholarly publishers began exploring formal archiving arrangements for e-journals and developing plans for moving toward implementation.
Creating and Maintaining Sites
The World Wide Web Consortium's (W3C) "Web Content Accessibility Guidelines, Techniques, and Checklist" provides some recommendations for good resource management (e.g., use of standard formats and backward-compatible software) and have had a major impact on the development of Web materials worldwide. However, the W3C guidelines do not expressly address content stability, documentation of change, or good database management. In fact, preservation and records management issues are noticeably absent.
In the United States, Web preservation is more directly supported through government policies and guidelines to promote accountability, spurred in part by such legislation as the Paperwork Reduction Act. Governments also are promulgating specific policies and recommendations for preserving government-supported Web content. In January 2001, the U.S. National Commission on Libraries and Information Science published "A Comprehensive Assessment of Public Information Dissemination," which recommended legislation that would "formally recognize and affirm the concept that public information is a strategic national resource." Another recommendation is to "partner broadly, in and outside of government, to ensure permanent public availability of public information resources."
The archivist's perspective has been quite influential, as arguments are advanced to treat Web sites as important records in their own right. National archives in many countries are developing policies and guidelines. The U.S. Federal Records Act, as amended, requires that agencies identify and transfer Web site records to agency recordkeeping systems, including the National Archives and Records Administration (NARA), for permanent retention. NARA has issued several bulletins on the disposition of electronic records that include Web sites. It has also slowly begun to respond to this new form of recordkeeping and has appraised at least one federal Web site as a permanent record. In late 2000, NARA established an initiative to capture a snapshot of all federal Web sites at the end of the Clinton Administration. NARA also has contracted with the San Diego Supercomputer Center for a project to investigate the preservation of presidential Web sites.
The National Library of Australia (NLA) has been a world leader in promulgating guidelines for preservation. In December 2000 the NLA issued "Safeguarding Australia's Web Resources," which provides advice on creating, describing, naming, and managing Web resources. The Council on Library and Information Resources funded NLA's Safekeeping Project, which targets 170 key items accessible through Preserving Access to Digital Information (PADI). NLA staff wrote to the resource managers, encouraging them to voluntarily preserve these materials and outlined nine strategies for long-term access. According to Susan Thomas, PADI administrator, 116 resource owners responded and safekeeping arrangements have been made for 77 items to date. Negotiations are in progress for an additional 33 resources. Eight resource owners lacked the appropriate infrastructures to comply with the recommendations. Alternative "safekeepers" have been approached for four of these. By the end of 2001, 54 resource owners had not responded.
Assuming Archival Custody
The third major focus of Web preservation has been to identify and ingest Web content into digital repositories. The best-known example is the Internet Archive, a not-for-profit organization associated with Alexa Internet, which has been automatically collecting all open access HTML [hypertext markup language] pages since 1996. Also in 1996, the NLA's Pandora adapted Web crawling to archive selected Australian online publications. That same year, the Royal Library of Sweden launched Kulturarw3 to collect, preserve, and make accessible Swedish electronic documents published online. For Pandora, ingest includes manual creation and/or clean up of metadata and the establishment of content boundaries. This approach may be cost effective for a few highly valuable documents but may be prohibitively expensive for large collections.
In 2001, the Internet Archive released the Wayback Machine, which lets users view snapshots of Web sites as they appeared at various points in the past. With more than 10 billion Web pages exceeding 100 terabytes of data and growing at a rate of 12 terabytes a month, the Internet Archive provides the best view of the early Web as well as a panoramic record of its rapid evolution over the past five years. It provides an invaluable tool for documenting change and filling some of the void in recordkeeping in the Web's early days.
However, this approach to Web preservation is only part of the solution to a much larger problem. The Internet Archive and similar efforts to preserve the Web by copying suffer from common weaknesses. Snapshots may or may not capture important changes in content and structure. Technology development, including robot exclusions, password protection, Javascript, and server-side image maps, inhibits full capture. A Web page may serve as the front end to a database, image repository, or a library management system, and Web crawlers capture none of the material contained in these so-called "deep" Web resources.
The sheer volume of material on the Web is staggering. The high-speed crawlers used by the Internet Archive traverse the entire Web every two months--even more time would be needed to treat anomalies associated with downloading. Not all sites merit the same level of attention, especially given limited resources, and means must be devised for honing selection and treating materials according to their needs.
Automated approaches to collecting Web data tend to stop short of incorporating the means to manage the risks of content loss to valuable Web documents. File copying by itself fails to meet the criteria RLG and OCLC have identified. For example, the Internet Archive has not overtly committed to continued access through changing file formats, encoding standards, and software technologies. In addition, legal constraints limit the ability of crawlers to copy and preserve the Web.
Project Prism
Current Web preservation efforts fail to consider the challenge of preserving content that an institution does not control or for which it cannot negotiate formal archiving arrangements or assume direct custody. Over time, preserving Web content will require substantial resource commitments, as well as flexible and innovative approaches to changes in technologies, organizational missions, and user expectations.
The National Science Foundation (NSF) has funded Cornell University's Project Prism, which is a joint research effort by the Computer Science Department and the University Library to support libraries and archives as they extend their role from custodians of physical artifacts to managers of selected digital objects distributed over the network. Digital curatorial responsibilities will need to be reconsidered and undertaken in light of cost, level of participation by cooperative or uncooperative partners, and technical feasibility. At the same time, the project aims to design archiving tools and services that will enable non-librarians to raise the information integrity of research collections that are now managed haphazardly, if at all. Ultimately, the goal is to create an approach to archiving distributed Web content that takes custody of digital files as a last resort, though the methodology also could be used for pre-ingest management.
Project Prism is producing a framework for developing an ongoing comprehensive monitoring program that is scalable, extensible, and cost effective. Its approach begins with characterizing the nature of preservation risks in the Web environment, develops a risk management methodology for establishing a preservation monitoring and evaluation program, and leads to the creation of management tools and policies for virtual remote control. The approach will demonstrate how Web crawlers and other automated tools and utilities can be used to identify and quantify risks; to implement appropriate and effective measures to prevent, mitigate, recover from damage to and loss of Web-based assets; and to support post-event remediation.
The project is exploring a noncustodial, distributed model for archiving, in which resources are managed along a spectrum from, at the highest level, a formal repository to, at the lowest level, the unmanaged Web. One of the goals is to show how the integrity of unmanaged resources can be raised at minimal cost, using automated routines for monitoring and validating files according to policies established by organizations that value the longevity of those resources. The overall goal is to create archiving tools that will enable libraries, archives, commercial database providers, scholarly organizations, and individual authors to manage different sets of risks affecting the same resources remotely.
Risk Management
A risk-based preservation management program begins with two key questions: What assets may be at risk and should be included in the program, and what constitutes risks to those assets? Risk management programs should be developed and implemented within an organizational context: Each institution will need to define its own "worry radius"--the context that provides definitions of perceived risk and acceptable loss. Effective risk management also requires determining the scope and value of assets. The cost of implementing the program should be appropriate to the estimated value of the assets and the impact of their loss on operations and services.
Risk management implementation defines policies, procedures, and mechanisms to manage and respond to identifiable risks. The implemented program should balance the value of assets and the direct and indirect costs of preventing or recovering from damage or loss. The program should be known and understood both within the organization and by relevant stakeholders. An effective program includes comprehensive scope, regular audits, tested responses and strategies, built-in redundancies, and openly available, assigned responsibilities.
Automated Support Strategies
Project Prism is exploring technologies that will form the basis for a suite of tools to support risk-based preservation monitoring and evaluation of Web resources. From a technical perspective, its goal is to design feasible and appropriate mechanisms for off-site monitoring. Assuming that over time libraries and other information intermediaries will extend their collecting scope over greatly increasing amounts of distributed content and that the longevity of these resources will be a primary concern, automatic methods will be needed to deal with such volume cost effectively and for consistent results that are less prone to human error. The methods will need to accommodate content providers who both cooperate in the effort, for example by contributing metadata, or content providers who, while not hostile to the idea of monitoring, are not collaborating. The methods also will need to be flexible enough to suit the variety of management requirements of diverse institutions.
These monitoring mechanisms should be deployable in a range of systems contexts. For a university research library, that context might be a management system used to collect lists of URLs that faculty and librarians have deemed important through some rating scale. The library might then employ the monitoring schemes outlined in the rest of this section as it assumes a role of "managing agent" for those external resources. At the other end of the spectrum, a preservation service might be a program that users could install on their own workstations to monitor Web resources of their own choice. This tool could be launched like other utility tools such as a disk defragmenter or an anti-virus scanner.
The Web resources within an organization's worry radius might be a Web site, a subset of resources in a Web site, or a single Web page or document. Furthermore, a Web resource might live in an individual's informally managed Web page or in an organization's highly controlled Web site. Defining the boundaries of a Web resource for preservation monitoring is not easy. Mechanisms for preservation risk management must address four levels of context:
* A Web page as a stand-alone object, ignoring its hyperlinks
* A Web page in local context, considering the links into it and out from it
* A Web site as a semantically coherent set of linked Web pages
* A Web site as an entity in a broader technical and organizational context
For risk analysis, some threats can be detected from the examination of a single static snapshot of a resource, while other threats become visible through analysis of how the resource changes over time. Project Prism is concerned with both the snapshot view and the time-elapsed view. For each of the four contexts, the team hypothesizes appropriate technical approaches for risk detection. By testing these hypotheses, the team can transform the results into the suite of tools it needs.
Monitoring a Web Page As a Standalone Object
As a stand-alone object, a Web page must be considered without regard to its hyperlinked context. What risk attributes are visible by looking at a single Web resource minus its link structure? Given a one-time snapshot of a single Web page, automated tools can observe these significant features:
* Tidiness of HTML formatting: Just as sloppy work habits reflect badly on an employee, untidy HTML is a reason for some unease about the management of a Web resource. While early versions of HTML had poorly defined structure, the recent redefinition of HTML in the context of XML [extensible markup language] has now formally defined HTML structure. The TIDY tool makes it possible to determine how well an HTML document conforms to this structure, revealing the sophistication and care of the page's manager.
* Standards conformance: Data format standards change over time, sometimes making previous versions unreadable. A monitoring mechanism could automatically determine whether a Web resource conformed to current standards. Conformance to open standards also could be considered. Arguably, Web resources formatted according to a nonpublic standard--for example, Microsoft Word documents--may be a greater longevity risk than those formatted to public standards.
* Document structure: Like HTML formatting, a document that manifests good structure may be more dependable than one that consists of text with no apparent order. Automated digital libraries such as ResearchIndex have had success with heuristics for deriving structure from PDF [portable document format], HTML [and other] documents. These techniques could be used to measure the level of structure in a Web resource.
* Metadata: The presence or absence of metadata tags conforming to standards such as Dublin Core may indicate the level of management.
Automatic mechanisms could track the following characteristics over time:
* HTTP [hypertext transfer protocol] response code: The HTTP protocol defines response codes that indicate transfer error or success. An off-site monitor could record the incidence of HTTP response codes over time and certain patterns of codes, such as a high frequency of 404 ("page-not-available") codes, could be used to measure risk.
* Response time: A server with widely fluctuating response times or consistently slow response time indicates a higher level of risk than one that is responsive.
* Page changes: For certain types of pages, no changes at all might indicate complete lack of management or maintenance. On the other hand, unpredictable and large changes might indicate chaotic management. Pages that change on some predictable schedule with some predictable delta might indicate high-integrity management. Monitoring mechanisms that employ copy detection methods or page-similarity metrics would be useful for developing a measurement for page changes over time.
* Page relocation: The lack of persistence of URLs is a well-known problem. Certainly, the disappearance of a selected resource, evidenced by consistent "page-not-found" errors, should be a cause for alarm. Techniques such as "robust hyperlinks" might make it possible to track the movement of a resource across the Web and use that movement and/or replication to determine risk.
Monitoring a Web Page in a Hyperlinked Context
The hyperlinked structure of a Web page, its in-links and out-links, has been exploited successfully in the development of better Web search engines. Similarly, such "link context," the links out from a page and the links from other pages to that page, may prove useful in deducing longevity risks.
Using a page snapshot, risks can be detected by analyzing:
* Out-link structure: Consider a page that links to a number of pages on the same server, in contrast to another page that either has no out-links or only links to pages on other servers. Intuitively, the "intralinked page" may be more integrated into a site and at lower risk. Pages with no links at all might be considered highly suspicious, having the appearance of "one-offs" rather than long-term Web resources.
* In-link structure: An equal if not greater indicator of longevity risk is the number of links from other pages to a page and the nature of those links. Ascertaining the absence of in-links in the Web context is hard because it requires crawling the entire Web.
* Page provenance: The URL of a Web page can itself provide metadata about the page's provenance and management structure. The host name often provides useful information on the identity (the "address") of the Web server hosting a page and, less reliably, the name of the institution responsible for publishing the page. A top-level domain name can help classify a publishing organization by type (.edu, .gov, .com). Also, the path name may provide clues about organizational subunits that may be responsible for managing a Web page or site. Project Prism will investigate the correlation between top-level domain name and preservation risks.
* Link volatility: Once the nature of the links to and from a page is determined, it is useful to compare changes in those links over time. If out-links are added or updated, a page is evidently being maintained and is at reduced risk. A decrease in in-links may indicate approaching isolation and should cause concern.
Assessing the Risk
Assessing the longevity risk of a Web site will require algorithms for aggregating the risk metrics of its individual pages. Additionally, the structure of the site might serve as an indicator of risk. To analyze this structure we can exploit the wealth of work and algorithms on graphs and the characterization of the Web as a directed graph. In this characterization, resources (documents) at URLs are nodes and the hyperlinks from documents at URLs to documents at other URLs are directed edges in the graph. The organization of a site's internal structure might be appropriate for risk analysis, just as for an individual page. Using graph analysis methods to derive cliques or strongly connected components from graph representations of site structure may make it possible to develop a set of patterns that reflect good site management.
Based on the static analysis of a site's structure, it would then be possible to analyze changes to it over time. How the Web site evolves should be considered another indicator of risk. A site where links are added or modified regularly and which conforms to a discernable structure exemplifies good management practices and, thus, lower risk.
A Web site is a collection of Web pages, but it also resides on a server within an administrative context, all of which may be affected by the external technical, economic, legal, organizational, and cultural environment. Identifying, monitoring, and managing the ecology of a Web site involves the individual and collective analysis of a number of factors at these different levels--more than just checking for HTTP codes that indicate a page is unavailable or has moved. Problems can be caused by server software misconfiguration, bad cables and router failure, denial-of-service attacks, and many other factors. It is entirely possible that the biggest threat to the continued health of a Web site has nothing to do with how well the site is maintained or even how often it is backed up but whether the backup tapes are stored in the same room as the server--increasing the chance that a single catastrophic event (fire, flood, earthquake) could destroy them both.
Some environmental factors can be monitored remotely, in tandem with direct monitoring of the Web site itself. Slowness or unresponsiveness could indicate hardware failure or power interruption, excessive load on the server from legitimate use, Web crawling, hacker attack, or a network problem. Network utilities such as Ping and Traceroute can help determine whether the problem is confined to Web services, the particular machine, or the larger network. Specialized software for the Web can reveal internal security hazards such as viruses, Trojan horses, outdated software, missing patches, and incorrect configurations. Adapting these tools and utilities will add to Project Prism's preservation risk management toolkit.
Assessing Technological Watersheds
Through the longevity study (www.library.cornell.edu/preservation/ prism.html) and future crawls of the Internet Archive, Project Prism is identifying significant technology watersheds that may put Web sites at risk. The Web crawler and other tools can be used to analyze the use of markup languages, MIME [multimedia internet mail exchange] types, and other attributes of Web pages that reflect evolving standards and practice. Certain periods may merit closer scrutiny than others. Times of intense and rapid growth generally coincide with greater competition and the need to be more agile and flexible to survive. Periods when many new standards and features are introduced also would present greater risk to content. The Web sites that have been captured in the Internet Archive provide an ideal set of materials for testing these hypotheses by allowing characterization of the introduction and domination of markup languages and formats, the introduction of various types of dynamic behavior, and changes in the use of header fields and tags.
Risk Research
Project Prism is using the Web crawler to study risk factors for Web pages and Web sites. At the server level, it is reviewing the kinds of tools that can be developed or adapted to analyze and mitigate potential risks. While an organization may take on the preservation management of its own Web sites, the project is interested in scenarios that must consider two organizational players: the entities that control the Web sites and the entities that are interested in the longevity of those Web sites. In the first round, significant factors in the administrative context and external environment are being identified, but in-depth work in these areas will be part of the team's follow-up research.
References
"A Comprehensive Assessment of Public Information Dissemination." Available at www.nclis.gov/govt/assess/ assess.vol1.pdf(accessed 29 July 2002).
Arms, William Y., Roger Adkins, Cassy Ammen, and Arlene Haynes. "Collecting and Preserving the Web: The Minerva Prototype." RLG News. April 15, 2001. Available at www.rlg.org/preserv/diginews/ diginews5-2.html#feature1 (accessed 29 July 2002).
Bergman, Michael. "The Deep Web: Surfacing Hidden Value." The Journal of Electronic Publishing. August 2001. Available at www.press.umich.edu/jep/07-01/bergman.html (accessed 11 July 2002).
Brin, S., and L. Page, "Anatomy of a Large-Scale Hypertextual Web Search Engine." Computer Networks and ISDN Systems. 1998.
Byrnes, Christian. "Information Risk Management: Why Now?" www.trusecure.com/html/tspub/whitepapers/irm.pdf
Fielding, R., J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, "Hypertext Transfer Protocol--HTTP/1.1." The Internet Society. RFC 2616, June 1999. Available at www.ietf.org/rfc/rfc2616.txt (accessed 11 July 2002).
Flecker, Dale. "Preserving Scholarly E-Journals"' D-Lib Magazine. September 2001. Available at www.dlib.org/dlib/september01/flecker/09flecker.html (accessed 11 July 2002).
Global Association of Risk Professionals (GARP). Available at www.garp.com/index-b.htm (accessed 29 July 2002).
Greenstein, D., S. Thorin, and D. Mckinney. "Draft report of a meeting held on 10 April in Washington, D.C., to discuss preliminary results of a survey issued by the DLF to its members." April 23, 2001. Available at www.diglib.org/roles/ prelim.htm (accessed 11 July 2002).
Information Management Forum Internet and Intranet Working Group (Government of Canada). "An Approach to Managing Internet and Intranet Information for Long Term Access and Accountability." Available at www.imforumgi.gc.ca/ iapproach_e.html (accessed 29 July 2002).
Kleinberg, J. M. "Authoritative Sources in a Hyperlinked Environ-ment." Journal of the ACM. 1999.
Kleindorfer, Paul R. "Industrial Ecology and Risk Analysis." Available at http://grace.wharton.upenn.edu/risk/downloads/01-23-PK.pdf (accessed 29 July 2002).
Kumar, S. R., P. Raghaan, S. Rajagopalan, D. Sivakumar, A. S. Tomkins, and E. Upfal. "The Web as a Graph." Presented at Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Dallas, 2000.
Kunreuther, Howard, Patricia Grossi, Nano Seeber, and Andrew Smyth. "A Framework for Evaluating the Cost-Effectiveness of Mitigation Measures." Presented at the Bogazici University/Columbia University Workshop. Available at http://grace.wharton.upenn.edu/risk/downloads/01-18-HK.pdf (accessed 11 July 2002).
Lawrence, Gregory W., William R. Kehoe, Oya Y. Rieger, William H. Walters, and Anne R. Kenney. Risk Management of Digital Information: A File Format Investigation. Available at www.Clir.org/pubs/abstract/pub93abst.html (accessed 11 July 2002).
Lawrence, S., K. Bollacker, and C. L. Giles. "Digital Libraries and Autonomous Citation Indexing." IEEE Computer 32, No. 6, 1999.
National Archives & Records Administration. "Records Management Requirements." Available at www.archives.gov/ records_management/policy_and_guidance (accessed 29 July 2002).
National Archives of Australia. "Web Policy and Guidelines." Available at www.naa.gov.au/recordkeeping/er/summary.html (accessed 29 July 2002).
National Library of Australia. "Safekeeping Strategies." Available at www.nla.gov.au/padi/safekeeping/safekeeping.html#ss (accessed 29 July 2002).
The NEDLIB Harvester Project. Available at www.csc.fi/sovellus/nedlib (accessed 29 July 2002/.
Nonprofit Risk Management Center. "Making Net Gains: Staying Safe While Making a Name for Your Nonprofit on the Internet." Available at www.nonprofitrisk.org/nwsltr/ current/nl901_3.htm (accessed 11 July 2002).
OCLC Web Characterization Web Site. Available at http://wcp.oclc.org (accessed 29 July 2002).
Phelps, T. A., and R. Wilensky. "Robust Hyperlinks: Cheap, Everywhere, Now." Presented at Digital Documents and Electronic Publishing (DDEP00), Munich, 2000.
Raggett, D. "Clean up your Web pages with HTML TIDY." W3C. 2000. Available at www.w3.org/People/Raggett/tidy/ (accessed 11 July 2002).
RLG-OCLC. "Attributes of a Trusted Digital Repository: Meeting the Needs of Research Resources." 2001. Available at www.rlg.org/longterm/attributes01.pdf (accessed 11 July 2002).
Rivard, Catherine L. and Michael A. Rossi. "Is Computer Data `Tangible Property' or Subject to `Physical Loss or Damage'?--Part 1 and Part 2." Insurance Law Group Inc., August 2001 and November 2001. Available at www.irmi.com/expert/articles/rossi008.asp (accessed 11 July 2002).
Shivakumar, N., and H. Garcia-Molina. "Finding Near-Replicas of Documents on the Web." Presented at WebDB'98, 1998.
Smithsonian Institution. "Archival Preservation of Smithsonian Web Resources: Strategies, Principles, and Best Practices." Available at www.si.edu/archives/archives/dollar %20report.html (accessed 29 July 2002).
Wood, Angus. "Integrating Risk Assessment into the Enterprise Information Management Strategy," presented at the Sixth International Pipeline Reliability Conference, November 19-22, 1996, Houston, Texas. Available at www.itpapers.com/cgi/PSummaryIT.pl?paperid=8433&scid=88 (accessed 11 July 2002).
World Wide Web Consortium, XHTML 1.0: The Extensible HyperText Markup Language, 2nd ed., 2001. Available at www.w3.org/TR/2001/WD-xhtml1-20011004/ (accessed 11 July 2002).
Zhang, K., J. T. L. Wang, and D. Shasha, "On the Editing Distance between Undirected Acyclic Graphs and Related Problems." Presented at CPM Combinatorial Pattern Matching, 1995.
READ MORE ABOUT IT
Additional Risk Management Resources. www.library.cornell.edu/iris/research/prism/rm-resources.html (accessed 29 July 2002).
Dublin Core Metadata Initiative. Available at http://dublincore.org (accessed 1 August 2002).
The International Risk Management Benchmarking Association (IRMBA). Available at www.irmba.com (accessed 1 August 2002); Risk Management Reports. Available at www.riskreports.com (accessed 1 August 2002).
The Internet Archive. Available at www.archive.org (accessed 1 August 2002).
Kleindorfer, Paul R. "Industrial Ecology and Risk Analysis" in Handbook of Industrial Ecology. L. Ayres and R. Ayres, eds. United Kingdom: Edward Elgar, 2001.
Kunreuther, Howard, and Patricia Grossi. "The Role of Uncertainty on Alternative Disaster Management Strategies." April 2001, Available at http://grace.wharton.upenn.edu/risk/ downloads/01-15-HK.pdf (accessed 11 July 2002).
Library Project Prism. Available at www.library.cornell.edu/ preservation/prism.html (accessed 1 August 2002).
McClure, Charles R., J. Timothy Sprehe, and Kristen Eschenfelder. Performance Measures for Federal Agency Websites. Available at www.defenselink.mil/webmasters/technical/measures/measures.pdf (accessed 11 July 2002).
McNamee, David. "Assessing Risk Assessment," in New Perspectives on Healthcare Internal Auditing. Available at www.mc2consulting.com/riskart2.htm (accessed 11 July 2002).
Mercator Web Crawler. Available at www.research.compaq.com/ SRC/mercator/ (accessed 1 August 2002).
Miller, Jean C. "Risk Management for Your Web Site." IRMI.com. 2000. Available at www.irmi.com/expert/ articles/schoenfeld003.asp (accessed 11 July 2002).
The National Risk Management Research Laboratories. Available at www.epa.gov/ordntrnt/ORD/NRMRL/ (accessed 1 August 2002).
Paperwork Reduction Act. Available at http://frwebgate.access.gpo.gov (accessed 11 July 2002).
PANDORA Archive. Available at http://pandora.nla.gov.au/ index.html (accessed 1 August 2002).
Preserving Presidential Library Websites. Sand Diego Supercomputer Center. Available at www.sdsc.edu/TR/TR-2001-03.pdf (accessed 1 August 2002).
Project Prism. Available at www.prism.cornell.edu/ (accessed 1 August 2002).
The Royal Swedish Web Archive. Available at www.ifla.org/IV/ifla66/papers/154-157e.htm (accessed 1 August 2002).
The Four Phases of Project Prism
Project Prism's four main phases map well to the typical stages of risk management programs.
Phase 1: Risk Identification--the process of detecting potential risks or hazards through data collection. The team is using both automated and manual techniques to collect data and characterize potential risks to Web resources. Web crawling is one effective way to collect information about the state of Web pages and sites. The Prism team employs the Mercator Web crawler to collect and analyze data to test hypotheses about the relationship between observable characteristics of Web resources and threats to longevity.
Phase 2: Risk Classification--the process of developing a structured model to categorize risk and fitting observable risk attributes and events into the model. The Prism team combines quantitative and qualitative methods to characterize and classify the risks to Web pages, Web sites, and the hosting servers.
Phase 3: Risk Assessment--variables to consider include the value of assets, possible threats, known vulnerabilities, likelihood of loss, and potential safeguards. The team is defining a data model for storing risk-significant information. This model reflects key attributes about Web assets, observed events in the life of these resources, and information about the resources' environment. A key aspect of risk assessment in Prism is defining and detecting significant patterns that may exist in this data.
Phase 4: Risk Analysis--determines the potential impact of risk patterns or scenarios, the possible extent of loss, and the direct and indirect costs of recovery. This step identifies vulnerabilities, considers the willingness of the organization to accept risk given potential consequences, and develops mitigation responses. Artificial intelligence methods, decision support systems, and profiles of organizations all support risk analysis. The resulting knowledge and exposure databases provide evolving sources of information for analyzing potential risks. Project Prism is developing a knowledge base that could be characterized as a risk analysis engine.
Web Site Care
Comprehensive care of a Web site must include:
* Hardware and software environment, including any upgrades to the operating system and Web server, the installation of security patches, the removal of insecure services, use of firewalls, etc.
* Administrative procedures, such as contracting with reputable service providers, renewing domain name registration, etc.
* Network configuration and maintenance, including load balancing, traffic management, and usage monitoring
* Backup and archiving policies and procedures, including the choice of backup media, media replacement interval, number of backups made, and storage location
* Physical location of the server and its vulnerability to fire, flood, earthquake, electric power anomalies, power interruption, temperature fluctuations, theft, and vandalism
Anne R. Kenney is assistant university librarian at Cornell University Library. She can be reached at ark3@cornell.edu. Nancy Y. McGovern is the digital preservation officer and heads the Research Department in Instruction, Research, and Information Sciences (IRIS) at Cornell University Library. She also is the co-editor of RLGDigiNews. She can be reached at nm84@cornell.edu. Peter Botticelli, Richard Entlich, Carl Lagoze, and Sandra Payette, also co-wrote this article.