Geneva protocol

  1. This document was initially drafted by Thomas Krichel. It was discussed by José Manuel Barrueco Cruz , Antonella De Robbio, Thomas Krichel and Imma Subirats Coll in Geneva, Switzerland on 2002-10-19. It is not yet an official document of rclis and may change at any moment. This version has benefitted from comments by Angela Cornwell and Ivan Kurmanov.
  2. The Geneva protocal is preceeded by two distinct, but interoperable efforts. These are
    • the proposal for an Academic Metadata Format (AMF),
    • the OAI protocol for public metadata harvesting (OAI-PMH ).
    The Geneva protocol also draws upon the experience with the Guildford Protocol as used by the RePEc project.
  3. This document assumes that the reader has basic familiarity with AMF, especially the concepts of "AMF noun instance", and "AMF record".
  4. The Geneva protocol combines an identification strategy with a repository strategy for the rclis collection. The identification strategy sets out the syntax of identifiers and the process of identification. The repository strategy sets out how to store the data in a repository so that it can be harvested by interested parties. In principle, these two issues are different. However, it is convenient to have a single document that covers both.
  5. rclis is a dataset put together through a decentralized effort. An archive is an individual dataset that is contributed to this collective effort. Usually an archive is stored on the hard disk of a public-access computer system. Usually all the data in an archive are maintained by the same person or organization.
  6. Each archive must provide an AMF record about itself, most importantly, where it is located. Optionally, an archive may provide AMF records on documents series, channels, groups, documents, persons, institutions and citations. We will define these terms in the following.
  7. A document is an object that is described with the AMF text noun.
  8. A channel is a grouping of documents by the way that they appear in the world. An academic journal or a collection of monographs are examples of channels. In an ideal world, rclis would obtain AMF data directly from those people who control channels, i.e. directly from the provider of the document.
  9. A series is a collection of AMF text nouns that describes documents within an archive. We assume that all documents described by an archive are grouped into series. We will assume that data in each archive is already divided into series and that all records in each series will belong to the same channel.
  10. A group is any collection of documents that is not a channel. For example all documents that have the same classification code, or all the documents that have been announced in the same current awareness mailing list can be considered a group.
  11. This version of this document does not deal with citations. How to deal with citations will have to be added later.
  12. rclis contents will be encoded in AMF. The relevant nouns are

    archives, series, channels, groupscollection
    documents, citations text
    personsperson
    institutionsorganization

  13. Archives will either be file archives or OAI archives.
    • An OAI archive is an archive that has the OAI-PMH interface, version 2.
    • The structure of a file archive is set out in this document. rclis will develop an external gateway that will export all its contents using the OAI protocol.
  14. The identifier of an archive shall match the Perl regular expression /^rclis:[a-z]{3}$/. For reference below, let $archive_handle be the handle of an archive. Within the archive handle, the component "[a-z]{3}" will be refereed to as $archive_code. Thus each archive has a code and a handle, and we form the handle from the code through prefixing the code with the string "rclis:".
  15. A file archive lives on a single web-accessible directory. Some files within this directory contain embedded AMF data. The names of these files must match the Perl regular expression /\.amf\.xml$/.
  16. The "id" attribute of all AMF nouns in the same archive must start with $archive_handle.
  17. Each file archive must contain a file $archive_code.amf.xml. This file is known as the archive file. It contains a single AMF collection noun, the id attribute of which is $archive_handle. This collection noun instance is called the archive element. The "accesspoint" child element of the archive element gives the URL of the archive. The contents of the "accesspoint" must be the URL at which the archive can be accessed. This must be a http or ftp URL. Therefore, the archive must be accessible via http or anonymous ftp. (Note: AMF will have to be augmented with such an element to make this really work.)
  18. The archive element may have one or more collection nouns as children, linked by the "haspart" verb. The "id" attribute of these collection nouns must satisfy to the regular expression /^$archive_handle:[a-z]{6}$/. Such an id attribute is called a series handle. Each of these collection nouns describes a series. Let $series_handle be a series handle.
  19. The id attribute of all other text nouns must start with the $series_handle of the series they belong to, followed by colon. Recall that all documents must belong to a series.
  20. rclis uses a document identifier scheme derived from the SICI. This is the rclci, pronounced "wrecky". rclis can not use the ISSN component of the SICI, because ISSN numbers are not available for free public lookup. In addition, the rclis management does see no requirement to have the control segment of the SICI. But the values of all "id" attributes of all nouns must comply with the URN naming conventions. Therefore rclci identifiers replace the ISSN with the $series_handle, followed by a colon, and encode any special character as to follow the URN syntax. Every id attribute to a text noun should follow the rclci syntax.
  21. For OAI archives none of the file naming conventions apply, but the constraints on and the recommendations for the "id" attributes apply in exactly the same way.
  22. A single master file archive will be found at the URL http://rclis.org/all. It has a different structure from any other file archive. It contains mirrored copies of the $archive_handle.amf.xml for all $archive_handles known to rclis. For file archives, it will mirror the archive file. For OAI archives, it will contain the AMF element returned by verb=GetRecord&metadataPrefix=amf&identifier=$archive_handle and store in $archive_code.amf.xml.
  23. A special file archive will be set up with a code rclis:org, for the collection of organizational data. This archive must only contain organizational data in the form of AMF organization nouns. All will have an id attribute of the form /^$archive_handle:orga:[a-z]{7}$/. This handle should bear little resemblance with the actual name of the institution.
  24. A special personal archive will be set up with the handle rclis:per. This archive must only contain personal data in the form of AMF person nouns. All person nouns will have an id attribute of the form /^$archive_handle:pers:\d{4}-\d{2}-\d{2}:[a-z_]+/, where the numeric expression is a valid date. This date should be a date in the person's lifetime.
  25. A special channel archive with the handle rclis:cha will build channel data. Channel data will associate one or more series to a channel. Each channel will be identified by an id that satisfies to the regular expression /^$archive_handle:chan:[0-9]{6}$/.
  26. The archive will keep authoritative data for a channel and link it to the series that describe it. The channel archive may link the descriptive record of a channel to zero or more series. Here is a fictions example that shows data from two series to form the data for the channel.
  27. The keeper of the rclis:cha archive serves as a general authority over all document data. While archives are free to propose any document data that they wish to, it is likely that only the series that are linked to a channel will be used by user services. Thus, rclis:cha serves as a clearing house. Deduplification will have to be handled at the level of the series, with rclis:can to play a crucial intermediating rule.
  28. A group will be identified by /$archive_handle:grou:[a-z0-9]{8}$/. Several archives may create and manage group descriptions. There is no central registry of group description.
  29. In any handle that matches /^rclis:[a-z]{3}:[a-z]{4}:/, the /[a-z]{3}/ part is known as the archival component of the handle and the /[a-z]{4}/ part is known as the natural component of the handle.
  30. Channels, persons, documents and institution may be subject to blessed handling.
  31. For channels, persons and institutions, a handle is blessed when its archive components and the natural component are removed. The blessed handle is independent of the archive the information has been provided by and the nature of the identified object. A handle that is not blessed is called an unblessed handle.
  32. For documents, the handle is blessed if the $series_handle component of the rclci is replaced with the channel handle.
  33. Services for end-users are encouraged to use blessed handles if they can trace back the blessed handle to an unblessed one. Blessed handles are more compact than unblessed handles. The passage from blessed to unblessed handles may be re-engineered at a later stage without affecting the blessed handles. Group identifiers will not be subject to blessing.