Issue: A group of FGDC metadata files was submitted in text format (Word Document.)
Challenge: We need the metadata to be in XML format.
Solution: Use the geospatial metadata tools found on the USGS website: http://geology.usgs.gov/tools/metadata/
-----------------------------------------------------
Today's post is about how to convert FGDC metadata stored as a text file into an XML file. I anticipate this to be a common issue for our project, because older metadata files were often stored as text documents. Thankfully, Peter Schweitzer from the USGS has provided us with easy-to-use tools that will turn them into XMLs.
The following steps show the transformation process. Note that this is only for metadata that is already in the FGDC format. The final result is an FGDC xml metadata file that can be transformed to ISO using methods described in earlier posts.
The following steps show the transformation process. Note that this is only for metadata that is already in the FGDC format. The final result is an FGDC xml metadata file that can be transformed to ISO using methods described in earlier posts.
1. Review the raw metadata files
The submitted metadata files were Word Documents in FGDC format from the 1990s. The metadata was organized using the section numbers for the component elements. FGDC uses a multi-level numbering system (1 - 1.1,1.2,1.3 - 1.1.1, etc.). These numbers are not usually referenced directly in the metadata anymore, but some older text files will include them.
original submitted metadata file
2. Cleanup with a text editor
Before transforming the files, they needed to be cleaned up in a text editor. For this process, it is helpful to consult the FGDC graphical. I used this reference for fixing element names and compound structures.
This was an iterative process, whereby I uploaded test files to the USGS editing and transformation tools (see next steps), looked for errors, edited the file, and tried it again. For this particular batch of metadata files, the following issues needed to be fixed:
- Change the extension from .doc to .txt
- Remove parenthesis around section numbers.
- Verify the names of section headers: Change the name of section 7 from “METADATA REFERENCE SECTION” to “METADATA REFERENCE INFORMATION.”
- Manually edit certain compound elements:
- Original:
- (10.2) CONTACT ORGANIZATION PRIMARY: Illinois State Water Survey, Office of Surface Water Resources: Systems, Information & GIS
- Changed to:
- 10.2 CONTACT ORGANIZATION PRIMARY
- 10.1.2 CONTACT ORGANIZATION: Illinois State Water Survey, Office of Surface Water Resources: Systems, Information & GIS
- Fix spelling mistakes, mis-matched naming conventions for organizations.
metadata after cleanup in a text editor
3. The CNS tool
The next step was to use a service from the USGS that can take text files of FGDC metadata and adjust them into well-formatted indented text. This service is called cns (chew and spit).
Links
- Info
- Online cns 1
- Online cns 2 (in case the other link is overloaded)
For these metadata files, cns was especially useful because it detected and deleted the multi-level numbers and replaced them with indented spaces. It also renamed the elements with underscores for spaces and adjusted the capitalization to title format.
file after the cns "pre-parser"
4. The MP tool
The final step was to upload the indented text files to the metadata parser tool. This tool will output the files in several formats, but the XML is all we need. Like the cns tool, it is important to check the error messages to see what metadata you might be losing in the process.
Links:
FGDC xml file after using the mp tool
Comments
I used the online versions of these tools for this test. However, it would be more efficient to download the utilities and run them as batches from the command line. http://geology.usgs.gov/tools/metadata/
The metadata parser tool is also available in ArcGIS. However, the cns tool is not.
A few elements were lost during the process that I deemed acceptable losses. They included axis elements in the reference system information section and some minor elements in the attribute section. The ISO format does not necessarily include such detailed information about either of these elements, and therefore, would not be included in our final output anyway.