Ellis Weinberger [1]
Article based on 29 January 2003 presentation at INSEAD, Fontainebleu
My name is Ellis Weinberger. I am a Research Associate at Cambridge University Library. I developed collection management policy, intellectual property rights policy, and security policy in digital object preservation for the CEDARS Project [2], and developed a migration strategy for the BBC Domesday digital object on behalf of the CAMiLEON Project [3]. The CEDARS Project and the CAMiLEON Project produced working examples of digital object preservation technology, and developed policy guidance based upon knowledge of digital object preservation technology.
I now supervise the digitising project, supervise catalogue typesetting, and develop the web site for the Taylor-Schechter Genizah Research Unit [5] at Cambridge University Library, and preserve and develop the editing and typesetting system for the Darwin Correspondence Project, [4] at Cambridge University Library.
Today I will explain what digital object preservation policy is for, and what questions policy should address.
There is one simple goal for any policy for digital object preservation. This goal is the long term preservation of the ability to use the intellectual content of the preserved digital objects.
We should use policy to help us make decisions in fields influenced by digital object preservation issues. Digital object preservation policy should help us decide both how to create digital objects and how to collect digital objects, and should help us judge plans for repositories to preserve digital objects.
The only way to ensure the long term preservation of the ability to use digital objects is to convince governments to support repositories for the preservation of digital objects. Long term is a period of time long enough for there to be concern about the impacts of changing technologies, and of a changing user community, on the preservation of information being held in a repository. Changing technology includes support for new media and data formats. This period extends into the indefinite future. Long term preservation is the provision of adequate facilities to protect or maintain the ability to use digital objects over the long term.
The repositories must be supported by appropriate legislation and money. The legislation must enable repositories to take all necessary steps to preserve the ability to use digital objects. The legislation must also encourage publishers to provide repositories with digital objects the repositories can preserve. The money must support the long term, continuing process of preservation.
A repository which is merely a random collection of objects will be more likely to lose financial support than one that has structure and purpose evident in the collection itself. University libraries and legal deposit libraries have shown themselves to be able to preserve print and manuscript collections down through the generations. These organisations are suitable candidates to preserve the intellectual content of digital objects for future generations.
Whether we acquire or create a digital object, we need to make sure that preservation of the digital object will be as cheap as possible. Acquiring or creating a digital object, and preparing it for use, will be expensive, and preserving it will involve additional costs. In order to protect the investment of our institutions in digital objects, we will need to make decisions, and take actions, before we start to create or acquire a digital object.
Why do we need to start so early? Good vegetable ink, on good acid-free paper, stored in a cool, dry, dark room, will last a thousand years. Vellum will last longer. Ink on paper is very stable. But a digital object, stored on any kind of electronic or magnetic medium, and left in a cool, dry, dark room, will probably not be usable after five years. Either the medium will have deteriorated, or the hardware to read the medium will have disappeared, or the software to interpret the information on the medium will have become unavailable. Therefore we will have to transfer digital objects regularly to a new platform in order to ensure preservation of the ability to use the intellectual content of the digital objects. A platform runs the software necessary to provide the ability to use the preserved digital object.
Since we will need to move digital objects from one platform to another in order to preserve them, we should use open and standard data formats. Well specified, open, and standard data formats for digital objects help increase the likelihood that the digital objects will be portable across platforms. The specification of a format needs to be good enough to enable anyone to write software for it, open, so that everyone can use the specification, and maintained as a standard by a reputable agency, so that it can be relied upon to be current and accepted.
We should use Free or Open Source software in digital objects we create or acquire. Software described as Free, or as Open Source is Free or Open in the sense that we can make copies of the software, distribute those copies, have access to the software's source code, and make improvements to the software. An example of this kind of software is GNU Linux. The use of Free Software, or Open Source software in our digital objects will help us to preserve our digital objects because:
The digital objects we create or acquire must have good documentation. Many people down the generations will be involved in the preservation of digital objects. We can help them by giving them as much information as we can about the digital objects.
We should document the structure of the digital objects, the data formats of the digital objects, the purpose of the digital objects, and the software used in the digital objects. The documentation needs to be produced in a clear and well structured manner. The structure of the documentation, and the terms used in the documentation, need to be explicitly explained in the documentation.
Acquiring and preserving digital objects will cost more than acquiring and preparing books for use in a library. Therefore, acquisition and preservation costs will restrict the number of items which an institution will be able to process. To help select the items, we must have clear collection management policies.
An institution must determine which digital objects it should preserve. An institution will probably have to take responsibility for the preservation of material it produces itself. This may be the product of a project funded by, or housed in, the institution. The subject holdings of the institution will depend on its responsibilities, which in turn will depend on its position within administrative structures. An institution may have links to local and regional organisations which depend on it to provide subject holdings. An institution may be responsible for material about its region, or for material of interest to its parent body.
Selection of digital objects for preservation will depend on the responsibilities, and on the financial and technical resources of the institution. An institution will tend to select an item based on the content of the item. Selecting the right items for a collection may also mean selecting items which the institution has the capability to preserve, and which can be preserved at reasonable cost. The responsibilities and resources of the institution will help determine, for example:
Institutions acquiring digital objects should consider whether they require the right to preserve the digital objects and the right to preserve access to their intellectual content, or whether the right to provide users with access to the acquired digital object is sufficient.
An institution which requires the right to preserve the digital objects and the right to preserve access to their intellectual content should ensure that the purchase agreement or the licence for a digital object includes the right to preserve the digital object, to preserve access to it, and to use its intellectual content after the licence has expired.
The purchaser or licensee should decide whether to preserve the object themselves, or whether preservation should be carried out by the publisher, the body paying for the object, or a third party. Preservation should be organised on a collaborative basis.
If access to the digital object is protected by a hardware device or a remote authorisation site, then the digital repository will need to preserve the ability to convince the digital object that the hardware device or the remote authorising site exists. Another method of dealing with technical content protection could be to negotiate with the publishers in order to obtain a copy of the object which is not protected by technical means, but which is protected by an access agreement fulfilling the needs of the publisher and the digital repository. The digital repository would need to preserve the metadata defining the access agreement.
Any implementation of access and usage control will depend on the accurate maintenance of metadata about staff roles, access restrictions, and access privileges. Information of some complexity will be discussed in this section.
In order to maintain the integrity of the usage and access restrictions, it is necessary to create metadata that sets out the restrictions that apply to a digital object. This metadata must itself be preserved and held securely. This involves preserving not only the raw information but also a change history that indicates the person who changed it, their role and their reason. It is then necessary to record information about the people involved, what roles they have, what access privileges attach to that role and when they were allocated the role. The complexity of the metadata continues recursively with records being required to indicate who allocates roles and privileges to roles. The security of the metadata might be enhanced by insisting that any changes to the metadata must be appended to existing metadata, and that any changes can only be done by two people working together.
Lest this be thought excessive, it should be remembered that insiders perform the majority of security compromises, and that the ability to change metadata can completely compromise the preservation of a digital object. Where the preserved records are digital, such changes can become impossible to detect unless the metadata is capable of demonstrating that changes have occurred
We need to guarantee that the repository implementation we are analysing provides long term preservation of the ability to use the intellectual content of the preserved digital object. Storage and retrieval of digital objects is not enough to provide digital object preservation. The only repository which provides long term preservation is one which preserves the ability to use the preserved digital object. Digital object preservation policy must be inspired by certain basic concepts.
The significant properties of a digital object are the most valuable properties of the intellectual content of the digital object. They may include the layout of the page, the pagination of the text, or the division of the text into chapters. These properties can be determined by deciding who the users of the preserved digital object will be, and what the reason is for preserving the digital object. Some users may need the ability to interact with the digital object, and some institutions may have a mandate to preserve the ability to interact with the digital object. Other users may only need to view the ASCII of the source code, and the institution concerned may have a mandate only to preserve source code.
The underlying abstract form can be thought of as relating to the digital object, in a similar way to that which the work printed as a book relates to the book. It is the essential structure of a digital object. Choosing the correct underlying abstract form to preserve the object allows all of the significant properties of the object to be preserved correctly. The underlying abstract form may be a text file, a directory of text files, or a network of various types of files and the software necessary to use the files.
As Holdsworth and Sergeant wrote: [6]
``Many a CD actually contains a file system, and successful operation only relies on that file system. Copying such a file system onto a partition on a hard disk delivers an equivalent working representation. File placement is unimportant. Thus the file system is a viable underlying abstract form."
If the significant properties include the text, the pagination, and the fonts, the most appropriate underlying abstract form might be a PDF (Adobe ® Portable Document Format) file.
Emulation is the use of a software program, an emulator, that mimics the performance of a computer system, so that digital objects designed to run on the original system will run on the emulator. Preservation by emulation is needed to preserve access to the intellectual content of preserved digital objects which are interactive or complex. A digital object like a dictionary on CD-ROM is composed of digital objects such as a database application, a user interface, format viewer applications, and various files formats of digital content. A digital object of this nature can only be preserved in a cost effective manner by emulating the platform for which it was designed.
Migration on request stores the digital object in its original file format, and uses a migration on request tool to transform it into a file format usable by viewer software at the time of transformation. The original file is preserved, to be transformed by future migration on request tools for future viewer software. By using migration on request we eliminate the propagation of errors and the loss of information caused by multiple migrations.
In order to preserve the ability to use the intellectual content of digital objects, the repository will have to preserve the object itself, the software intermediaries which provide the ability to use the intellectual content of the object, and all the metadata for the object.
We must ensure that appropriate software intermediaries are available which can transform the preserved digital object into a usable form. We must also preserve a platform which can run the software intermediaries. By providing these elements it is possible to preserve the ability to use the intellectual content of the preserved digital objects.
For example, in order to preserve the ability to use a TIFF file, it is necessary to preserve a TIFF viewer for a specific file, depending on how the file was created; an operating system which can run the TIFF viewer; a platform which can run the operating system; and a specification for the TIFF format, in case we need to write viewer software.
In order provide long term preservation of the ability to use the intellectual content of the preserved digital objects, we must preserve all the digital objects in a well structured and well audited way.
The number of copies, the number of sites, and the varieties of software and hardware should be based on an examination of:
Any digital repository implementation must preserve the significant properties of the digital object using the underlying abstract form of the digital object. It should store the software intermediaries needed to preserve the significant properties in a representation network. The ability to use the preserved digital object can be preserved by the use of emulation tools or by the use of migration on request tools. A representation network manages the information about the tools which render or transform the preserved digital object into a usable form. The preserved digital object needs to be rendered or transformed into a form which provides the ability to use the intellectual content of the digital object. The tools managed by the representation network include utilities for compressing and expanding files, emulation tools, migration on request tools, and manuals for using the digital objects.
The goal of digital object preservation policy is the long term preservation of the ability to use the intellectual content of preserved digital objects. Policy should help us decide how digital objects should be created, how digital objects should be collected, and which digital repository implementation should be chosen. Long term preservation of the ability to use digital objects depends on government funding and appropriate legislation. Good policy should enable us to examine proposed solutions before we make expensive mistakes.
If you have any questions, please e-mail ew206@cam.ac.uk.
© Ellis Weinberger; last revised December 2003