Skip to main content

OAI-PMH Harvesting


The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a framework that can be used to pull metadata records from enabled repositories.  The records are transferred in a simple Dublin Core XML format.  This is an open and well documented framework, and there are many tools available that can perform metadata harvests.

The method for requesting the records uses a URL that has been set up by the repository.  Refining the request can come in the form of “Sets” or sub-collections, and “From” or “Until” for records added between a period of time.  Some repositories may have published their OAI-PMH URLs, but others may only release them upon request.

The records are harvested as Dublin Core XML, but use a dedicated schema (http://www.openarchives.org/OAI/2.0/oai_dc.xsd), prefix (oai_dc), and namespace (http://www.openarchives.org/OAI/2.0/oai_dc/).  These may need to be transformed if importing the records into an editor.

Here are brief descriptions of some harvesting tools:

OAI PMH Validator: This is a website that allows the user to validate an OAI-PMH URL, shows the available sets, allows a full download of the records, and more. This site is a very easy way to check the repository, and a quick way to download the raw files.

MarcEdit: This desktop application can harvest OAI-PMH records as the raw files or can apply a transformation to a chosen standard.  The user will need to enter the OAI_PMH URL and the name of the set.  When importing records, the default transformation options are MARC or MODS, but the user can add XSLTs for other formats.  This editor offers many batch editing functions for MARC records, so the tool would be most useful if MARC was being used as a transitional or output metadata standard.

Omeka Plugin: OAI-PMH Harvester: This is an add-on for the Omeka platform. Omeka can be installed on a server or run as a hosted version.   The hosted version also offers the harvesting plugin.  After entering the URL, the application will display the list of sets.  The user selects a set, and the records are pulled into the database and assigned to a new collection. Omeka is an application built on a MySQL database, the records are imported using PHP and the metadata is inserted into database fields.  This tool works well for editing Dublin Core Extended records, which can be exported as XML or JSON files.  There are a few scripts available as well for exporting records as CSV.  Finally, a second plugin, OAI-PMH Repository, can be added to the application to expose the records for harvesting in another application.


GeoNetwork:  This application offers several methods of harvesting records, including OAI-PMH, Catalogue Service for the Web (CSW), Thredds, Z3950, and others.  Although GeoNetwork is an ISO standard centric editor, it has incorporated templates for Dublin Core.  Like Omeka, the OAI-PMH URL is entered, and the application will display the list of sets.  The records will import as XML records and can be assigned to a category.  GeoNetwork will store the raw XML records, but it will only display Simple Dublin Core.  This means that if there are any fields in the transferred records that are outside of the Simple schema, they can only be accessed by downloading the record. (Note: As of this writing, we encountered several problems with using GeoNetwork for harvested Dublin Core records. This will be addressed in the next post.)