Encoding full-text links in the eprint jump-off page

by Greg Tourte and Andy Powell
UKOLN, University of Bath
DRAFT Version 1.0


Introduction

Using simple Dublin Core to describe eprints [1] provides a set of guidelines for using simple Dublin Core [2] metadata to describe eprints. The intention is to encourage consistency in the metadata that is exposed by eprint archives using the 'oai_dc' format within the OAI Protocol for Metadata Harvesting (OAI-PMH) [3]. The document recommends using dc:identifier to encode the URL of the 'jump-off' page for the eprint (as served by the eprint archive from which the eprint is available) and dc:relation to encode the URLs for each manifestation of the eprint (PDF, HTML, RTF, etc.).

However, there is a problem. Because dc:relation can be used to encode other URLs (for example, the URLs for other documents that are cited by the eprint being described) it is not possible to write software that can easily harvest the full-text of an eprint, knowing only the simple DC metadata about the eprint. Similarly, there is no relaible way of parsing the jump-off page in order to determine the URL for each manifestation of an eprint.

This document attempts to solve this problem by recommending a mechanism for unambiguously encoding the URLs for each manifestation of an eprint within the (X)HTML [4] jump-off page for the eprint.

Providing links to the full-text of an eprint requires two pieces of information:

A commonly used technique for providing these two bits of information is to encode them both within dc:format, for example:

  <dc:format>text/html http://eprints.bath.ac.uk/12345/</dc:format>

Unfortunately, this approach breaks the intended semantics of the DC Format element [5] and should not be used for interoperability reasons.

In summary, simple Dublin Core does not contain enough structure to allow the full-text URLs to be reliably extracted from the oai_dc record.

Using the (X)HTML <link> element in the jump-off page

Recommendation: Use one or more <link> elements in the <head> section of the (X)HTML 'jump-off' page, to provide the URL of each manifestation of the eprint. The <link> element should be used as follows (in XHTML):

  <link rel="alternate" class="fulltext" type="[MIME type]" href="[URL]" title="Full Text ([mime type])" />

The values enclosed in '[]' are dependent on the file to which the link is being made. Note that the 'title' attribute is optional but, because some browsers will display it as link information, it will help differentiating between the different manifestations of the eprint. For example:

  <link rel="alternate" class="fulltext" type="text/html" href="http://eprints.bath.ac.uk/12345.html" title="Full Text (text/html)" />
  <link rel="alternate" class="fulltext" type="application/pdf" href="http://eprints.bath.ac.uk/12345.pdf" title="Full Text (application/pdf)" />

The usage in HTML is the same but without the '/' before the closing '>'.

Using qualified Dublin Core

An alternative solution to the one proposed here would be to encourage eprint archives to expose a richer metadata format than simple DC ('oai_dc'). Such a format could be based on qualified Dublin Core and make use of the existing refinements of dc:relation to encode a URL for each of the manifestations of the eprint. For example:

  <dcterms:hasFormat xsi:type="dcterms:URI">http://eprints.bath.ac.uk/12345.html</dcterms:hasFormat>
  <dcterms:hasFormat xsi:type="dcterms:URI">http://eprints.bath.ac.uk/12345.pdf</dcterms:hasFormat>

The proposal in this document should be seen as complimentary (rather than in competition) with alternative approaches based on richer metadata formats.

Conclusion

If the mechanism proposed here to use the (X)HTML <link> element is widely adopted, OAI service providers will be able to:

References

  1. Using simple Dublin Core to describe eprints
    < http://www.rdn.ac.uk/projects/eprints-uk/docs/simpledc-guidelines/>
  2. Dublin Core Metadata Initiative
    <http://dublincore.org/>
  3. Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S. (eds.), 2002, The Open Archives Initiative Protocol for Metadata Harvesting
    <http://www.openarchives.org/OAI/openarchivesprotocol.html>
  4. HyperText Markup Language (HTML)
    <http://www.w3.org/MarkUp/>
  5. Dublin Core Metadata Element Set, Version 1.1: Reference Description
    <http://dublincore.org/documents/dces/>
University of Bath UKOLN