Summary
This post describes a technique for scraping Portal Discovery Metadata from a custom site and merging it with Standards Documentation Metadata in accompanying XML files. The example portal used is PASDA, but this could be modified for other repositories.Background
The BTAA GDP aggregates metadata to provide a catalog of geospatial resources from public data providers. There are generally two types of sources for the metadata:- Portal Discovery Metadata: This is found within the data provider's portal application and may include minimal elements, such as title, date, description, and links. Several structured data portal applications, such as ArcGIS Hub and Socrata provide this through their API as DCAT. Other portals, such as CKAN, have APIs that expose the Portal Discovery Metadata in a custom schema.
- Standards Documentation Metadata: This is a file that accompanies the dataset and includes much more detail, such as spatial reference systems and accuracy reports. For geospatial resources, this is usually in the FGDC or ISO XML schema.
The Challenge
Some portals do not offer an API, and the Portal Discovery Metadata may not be sufficient to meet the schema for our GeoBlacklight application. In this case, it did not include bounding boxes. Further, if a collection of Standards Documentation Metadata is available, it can be difficult to extract the required discovery metadata, because they files are frequently inconsistently formatted and are almost always missing access and download links.The Solution
Use a custom Python script to scrape online metadata and WGET to download standards metadata. Convert the standards metadata to a spreadsheet and merge it with the scraped metadata.1. Scrape the Portal Discovery Metadata
Although the Portal Discovery Metadata is minimal, it usually has the advantage of being normalized, and each dataset landing page should include the same information elements using the same structure. To see the structure, view the Page Source of a page and inspect the HTML. The metadata should be reliably provided within classes or IDs that have distinct labels. A Python script using Beautiful Soup can extract these and write them to a CSV file.
This script requires an input of all of the pages that should be scraped. There are several ways of obtaining this list. It can be derived from web browser plugins that export all links on a web page to a file or by manually selecting all the links at once from a search results or browse lists and copying & pasting them into a spreadsheet.
Once a CSV containing a list of links has been made, it can be added to the Python script. This script will visit each page and extract the Title, Date, Publisher, Description, and Standards Documentation Metadata Link to a spreadsheet. Since it processes each page in order on the CSV, it can be merged to the original list with the landing page links.
2. Harvest the Standards Documentation Metadata
Now that you have a list of all of the Standards Documentation Metadata XML files, you can download them using WGET. Make a separate CSV file of just the links to the XML files. (Note: to get the correct links, this might require adjusting changing the base URL or the extension if the links are to HTML pages.)
Save the CSV of metadata links to your WGET folder. Open the terminal and change directories into the WGET folder. Run this command:
3. Extract Values from the Standards Documentation Metadata to a Spreadsheet
You now have a group of XML files. Another Python script could be created to programmatically extract certain values. However, both FGDC and ISO are highly nested XML files that can require fairly complex queries, so I use software applications for this. GeoNetwork will export ISO files to CSV. For FGDC, MarcEdit is my choice.
pre-step: insert the metadata filename into the metadata document itself
One of the unfortunate characteristics of Standards Documentation Metadata files is that they often do not include a direct reference to the dataset in the form of a URI or linkage. The title in the metadata file often differs slightly from the title of the dataset. This makes it difficult to programmatically pair up the Standards Documentation Metadata with the Portal Discovery Metadata. The name of the metadata file is often the only value that can be matched.
The filename can be inserted into the metadata file using the Batch Metadata Modifier Tool (BMM). The first step would be to batch insert the string UNIQUEFILENAME into an unused field. This can be done with either a template in the BMM or a script. The BMM has a function for finding this string and replacing it with the dataset name. For this batch of records, I replaced the Title field with the filename.
The following text and screenshot was copied from instructions at https://insideidaho.org/helpdocs/batch_metadata_modifier_tool.html#rdsn
The batch metadata modifier tool can be used to replace XML metadata file names through the "Replace with Dataset Name" checkbox. From the menu bar, select Edit → Replace. In the pop-up window, check the box next to "Replace with Dataset Name". By default, this replaces the string "UNIQUEFILENAME" with the name of a user-specified file (in order to establish a valid URL).
a. Convert to MARC format
Click on tools and select Batch Process
For Source Directory, navigate to your folder of XML files. Type "xml" in the File Types box, and select FGDC=>MARC from the list of functions.
(Note: as of the current MarcEdit for Mac version (2.5.26), this step wasn't working on my MacBook. However, the Windows version of MarcEdit is usually updated first, and I found that it worked perfectly in Windows).
b. Join the records into a single file
Find the MARCJoin function (it may be on the main page or under Tools)
For Save File, create a new name with the .mrc extention. For files to join, navigate to the folder of newly created MARC files and select all of them. Click the box on for Join Individual Files and select Process.
c. Export selected fields to CSV
Click on Tools - Export Records- Export Tab Delimited Records
For MARC File, navigate to the Joined file made in step 2. For delimiters, I use a comma with pipes as in filed delimiters.
The MARC file should be inspected for what fields would be useful to export. If you want the contents of the subfields in different columns (such as bounding box coordinates), they should be separated. For this case I chose the following:
034$d (west extent)
034$e (east extent)
034$f (north extent)
034$g (south extent)
650 (theme keywords)
651 (place name keywords)