28 Archival and dissemination

An important aspect of electronic publishing and workflow is the need to allow for re-use and archiving. With traditional, analogue, publishing, the publication is created in it’s physical printed form and can be stored in a variety of means.  With digital publications, there is only the soft copy which requires a program to interpret it. It is a common problem already in the field of software and hardware obsolescence.[1] There are several archive friendly formats available at present, and the assumption is that these will continue to be supported.

Archiving formats

  1. TIFF: Tagged Image File Format
    This has become the primary photographic format, as it stores high quality images in a RGB, CMYK and LAB colour modes, with high compression to quality ratios, is interoperable between various programs, and can store OCR information, allowing the textual content within images to be searched.
  2. PDF/A: Portable Document Format / Archive
    This stems from the wide spread use of PDF for print publications. While PDFs are versatile as digital documents, with in-built allowances for interactivity, multi-media and security. The requirements for this however is that it be created and read through Adobe made software. However, there are numerous other methods by which one can create a PDF (such as GhostScript), and therefore the PDF/A was created as a format for archiving print publications digitally. Unlike a TIFF, it can store a combination of raster and vector imagery, thus improving quality. The PDF/A is designed so that there is no need special features aded and is used solely to display material intended to be printed and is capable of being read and interpreted by the most basic of PDF readers.
  3. XML: eXtensible Markup Language
    XML has a distinct advantage of being based in plain text. Whether using ASCII, Unicode or any other form of encoding, it is inherently a plain text document. This gives it the very specific advantage of readable by the most basic of programs. XML is designed as a storage medium and programs are created to work with it, not the other way around.  As such, even proprietary XML created for a specific program is readable through a simple text editor. Therefore it makes sense as a storage medium as the content is interpretable as long as the language (human) is understood.

Archive projects and initiatives

Internet Archive

This project began as an attempt to store the web as it developed. Overall though it functions as a donator-driven repository for digital information. Its functions range from running the Wayback Machine (which stores archives different versions of web-sites) to storing music, video, texts and imagery in a variety of formats, as well as emulators for old technological devices and programs.[2] See the video here.

The Open Library

This project also founded by the Internet Archive seeks to create a website for every book. One could argue that this approach is similar to what is undertaken by Google Books, but there is a difference. Google books intention is to digitise every book ever written, making it searchable, thus allowing any interested reader to find an odd phrase, recall a passage, trace a reference, extract metadata, and find the book in a library or bookstore. The Open Library focuses rather on the last aspect, how to track down every edition of a specific title. While Google Books is organised around individual editions, The Open Library focuses on the book as a encompassing context and holds all editions within it, and provides links to readers to access these editions in their various formats, either in a public domain form or through the WorldCat library database.

Project Gutenberg

Project Gutenberg was founded by Michael Hart under the idea that “anything that can be entered into a computer can be reproduced indefinitely”. As such, while the Internet Archive stores information in it’s original form, Project Gutenberg seeks to make books more available by re-typing them in and creating a pure digital base version. This text file can then be reworked into HTML and e-book formats for more tailored consumption.

Rather than preserving the container, PG preserves the content for digital consumption.

An accurate history can be found on PG written by Marie Lebert.

University of Pennsylvania

The Upenn Library hosts an incredible archive of texts and imagery, including interestingly the Penn in Hand: Manuscript Collection which hosts high quality scans of manuscripts visible on-line. It is an on-going effort of libraries to scan in older material to preserve it, often because the original printed substrate has begun to decay.

Projects such as these highlight the importance of metadata in fully describing the information so that we can go and search for it.


  1. This refers to the aspect whereby hardware and software is no longer used, and subsequently becomes hard to access. Consider the material lost on Betamax video tapes, Sega game cartridges, Microsoft reader files. As production technologies change and are surpassed, sometimes the old programs we used are lost and no longer supported. So we create material, but loose the means to access it.
  2. See the software archive.

License

Publishing in the Digital Environment Copyright © 2013 by University of Pretoria. All Rights Reserved.