Skip to main content

Deleting a node in a batch of XML files using Oxygen


Issue: Keywords in a group of records within GeoNetwork were too specific and not normalized

Challenge: Editing each record individually would be time consuming, since many keywords would have to be deleting one by one.  Additionally, the keywords were grouped in sections and some can not be deleted easily from the interface. Our project's CSW update process was adding new keywords, but not deleting existing ones, nor could it address the messy structure of the keyword sections.

Solution: Download records, delete entire keyword node using Oxygen, re-upload, perform a CSW update to insert new keywords.

-----
Overview

This post focuses on the utility in Oxygen for deleting a node in an XML file.  I decided to use this technique because I had group of ISO 19139 files where the keyword section was corrupted with multiple nests, and the keywords themselves were not aligned with our preferred metadata vocabulary.  Deleting the sections manually in GeoNetwork did not always work, so I wanted to "start fresh."  I already had a spreadsheet ready with the replacement keywords.

Our online metadata editor, GeoNetwork, exports records as MEF files.  This is a zipped format that includes subfolders.  Each subfolder is named by the record's UUID.  A sub-subfolder named metadata contains a file called metadata.xml.  This means that all of the ISO 19139 metadata files are called the same thing.  Therefore, working with a batch of exported records requires processes that can be recursive, or that look through subfolders in a directory.

We want to delete an entire node, so we need to indicate the XPath.  For an ISO 19139 file, the keywords section is called gmd:descriptiveKeywords


Steps
1. In GeoNetwork, select records for export, choose export Zip, and unzip the download.
2. Open Oxygen and select "Find" from the top menu
3. From the dropdown, pick Find/Replace in Files
4. Fill in these values:

  • Text to find: .*
  • Check the box next to Regular expression
  • Restrict to XPath: //gmd:descriptiveKeywords
  • Under Scope, select Specified path: and navigate to the folder
  • Include files: *
  • Check the box next to Recurse subdirectories

---
This effectively and quickly deleted all of the keyword sections in ~250 files.  I could then re-upload the files to GeoNetwork and use a CSW Update process to insert the new keywords.  (Note: more information about the CSW Update can be found here.)