Skip to main content

Using sentence-case for keywords in OpenRefine

Issue

Capitalization and pluralization of ingested keywords vary.  Our keyword list for in GeoBlacklight is somewhat messy and contains near duplicates.

Challenge

Our instance of Solr for GeoBlacklight indexes Dog, dog, dogs as separate keywords.

Solution

Use OpenRefine to normalize keywords before importing to Solr.

Description

As we aggregate metadata records from multiple sources, we found that the keywords need attention. The GIS records have keyword groups that may or may not come from a thesaurus, but frequently are coming from the TAGS field in ArcGIS Open Data Portals.  As a result, the keywords are frequently just regional acronyms or abbreviations and often have many spelling variants.

We also anticipate combining our metadata records with those made at other institutions outside of the Big Ten Academic Alliance Geospatial Data Project.  After reviewing records from other universities and consulting the RDA rules on capitalization, we decided to convert theme keywords to a sentence-type case, where only the first letter of the first word is upper case.

OpenRefine can remedy these problems quickly with just a few commands.


Process:


1. We first need to export all of our ISO19139 XML records from GeoNetwork into a spreadsheet.

2. Create a project in OpenRefine by uploading the CSV.

3. Select the arrow next to the header for the theme keywords and select "Edit cells - Split Multi-valued Cells."


 4. Enter the separator being used within the cell.  In our case, this is the characters ###.

5. Select the arrow next to the header again and select Edit cells - Transform.


6. In the box that appears, enter the following Grep command:


toUppercase(substring(value,0,1 ))+toLowercase(substring(value,1))
This corrects the capitalization.  Note: if the theme keywords contain any proper nouns that you want to retain capitalization for, you may need to adjust those separately.

(Above code is thanks to http://keithmaguire.net/articles/open-refine-short-recipes.html#firstletter)

7. Next, we can clean up some of the keywords by clustering.  Select Edit cells - Cluster and edit


8. This will group similar keywords together and give you the option to merge them to a selected value.  There are a few built in options to experiment with to find more clusters.  Clustering will help you find spelling mistakes, pluralization variations, and similar phrases.



9. You need to manually select the box under Merge? and then accept or type in a value on the right hand column to change the keywords.

10. When completed, select Edit cells- Join multi-valued cells.


11.  The last step is to export your project back to a CSV and reload the new keywords into GeoNetwork.