Skip to main content

Harvesting from CKAN and Sorting Adjacent Key Value Columns


Many of the open data portals are built on the open source application CKAN.  Metadata can be harvested from these portals using the CKAN API.  Many CKAN items include numerous resource URLs, including download links of varying formats, landing page links, web services, and applications. Sorting through myriad of links can be challenging. This post describes how to:


  1. Harvest metadata with the ckan-exporter script
  2. Use OpenRefine to sort the resource URLs.

Part 1: Harvest metadata with the ckan-exporter script

The CKAN API has a number of calls that will return information such as a list of items, tags, or organizations.  It will also return the item's metadata, also described as a package in the API calls.  The ckan-exporter script allows the user to define a list of desired metadata elements that can be harvested via the command line. The readme file includes documentation for how to set up the harvest files and examples are included.
The BTAA GDP fork of the ckan-exporter can be found here.  This fork includes two custom json files that specify which metadata elements will be returned.  Each CKAN site uses the metadata schema slightly differently and the json file should be edited to match the terminology.  To inspect the schema, append this string to the end of the CKAN instance's base URL: /api/3/action/current_package_list_with_resources


To run the script, open the command line and change the directory into the ckan-exporter folder. Then, at the prompt, type the word bash and hit return.,
Next, insert the following phrase, but with the text in red replaced for your project:
ckanapi-exporter --url 'baseurlofsite' --columns nameofcustom.json > nameofcsvfiletocreate.csv


example:

bash-3.2$ ckanapi-exporter --url 'https://www.opendataphilly.org' --columns columns_philly.json > 08c-01.csv

This will produce a CSV of metadata.  However, the formats and the resource URLs will be multivalued fields that need to be sorted. The name of each format will correspond in order to a URL in the next column.

Part 2: Use OpenRefine to sort the resource URLs


  • Start OpenRefine and upload the CSV to create a project.
  • Click the arrow next to Formats and select Edit Cells > Split multivalued cells.
  • Split into multivalued cells again for the Links column.



  • For each record, there will now be nested lists of formats and links that correspond on a one to one order.

  • Click the arrow next to Formats and select Transpose > Columnize by Key/Value columns

  • On the popup screen, make sure Formats is selected for the Key and Links for the Values.


  • Click OK.  The links will now be sorted under columns by format. Export the project and finish editing in a spreadsheet editor.