Texas State Library and Archives Commission - Preservica APIs in practice
Tagged inTexas Digital Archive Postman Python Pandas MatPlotLib Universal Access Entity API Content API
Brian Thomas, Electronic Records Specialist at Texas State Library and Archives Commission (TSLAC), considers why they started using Preservica APIs, the most common ways these are used and the steps they use to work out new processes.
When did you start using the APIs and why?
It all started with usage statistics for the Texas Digital Archive (TDA). In 2016 a state agency was exhibiting some hesitance in transferring electronic records to the State Archives as it might affect their online usage statistics, a key metric for them when requesting funds from the legislature. Initially this seemed like an impossible task given Preservica’s UUID system for calling up items. When the API was released for Preservica v5, the documentation showed the possibility of gathering all the data/metadata for items in our repository in an effective manner; getting us one step closer to the usage statistic goal. After unsuccessful experimentation on my own, I attended the 2018 Preservica Global Usergroup meeting where I watched a demonstration of the Postman Google Chrome plug-in interacting with the Preservica API. Postman allows you to try different techniques to interact with an API and, when something works the way you want it to, you can download the code for recreating that action using many different scripting languages; in my case the language of choice was python (note that the Chrome plug-in has since been deprecated and you need to download their app to use the tool). I was further assisted by several bash script examples from Preservica staff which modeled how to interact with the API at a command line level. I cannot thank those staff enough for their help. Reverse engineering the examples to python took care of the rest. Beginning in May or June 2018, API work began in earnest.
What are the most common ways you use the API?
- Data harvesting for manipulation and collection level usage statistics:
This is by far the most common API task I do for the TDA and leads to being able to do most of the other tasks. Early on I created a harvester that takes a system UUID, then harvests and saves the data as an XML file to a local folder. Initially this was done one level of depth at a time (i.e. one folder down across a collection, two folders down, etc.), using XSL transforms to get information for the next level of depth within the hierarchy. Later, as my knowledge of python and data manipulation increased, the script was upgraded to parse folder-level data to get the UUIDs for child folders/items and harvest data iteratively until an entire collection is harvested. Basically, set it and forget it.
Once the data is harvested I need to create a spreadsheet with UUIDs for every folder/item in a collection. To create a spreadsheet, I wrote a separate script that crawls through harvested data one data file at a time, extracts the UUID and item/folder name, and saves that to an excel spreadsheet. From there, every month I download usage statistics for the public access portal from Google Analytics and use another script that leverages the python Pandas and MatPlotLib modules to pair analytics information about what was viewed with every possible item in a collection to generate usage numbers and graphs. The pairing process takes about 10 or 15 minutes for 100 or so collections (2.5 million-ish items).
Data harvest and manipulation has also led to the ability to create tables of information about collection assets for customized search pages. You can see an example of that here: https://www.tsl.texas.gov/arc/tda/search/supremeCourt.
- Metadata remediation: Although slightly less common than data harvest in terms of volume, this has been the most important aspect of using the Entity API. As humans it is easy to make transcription errors when creating metadata, inserting an unnecessary space, forgetting that a data point has changed, etc. Or a standard changes and we now need to push an update across the entire repository. For a handful of items, these problems are easy to fix manually, but at scale this becomes impossible without programming. This is where the harvested data comes back into play. If a problem gets identified either through data manipulation or more general review (like finding two versions of a term in UA facets), I will use text editors or programming to fix the problem in the data already harvested and use an upload script to post the corrections using the Entity API.
Several massive projects have involved data remediation at-scale. For example, there was a collection with over 300,000 born-digital images with rich embedded metadata that had not been extracted into sidecar .metadata files before ingest. This metadata was perfect for faceting, rendering and clickable links. We still had local copies of the images, but the images had been public for years so re-uploading the images was not considered an option. Using the API and harvested data we were able to merge metadata extracted from the images (exiftool) with the generic metadata assigned at the item level in the TDA. This let us make implicit data explicit and add facets like creator. A more recent large-scale project has been undertaken to use the API and various pieces of generic metadata to embed a date of creation in the Qualified Dublin Core metadata for every born-digital item in our repository.
Another remediation task has been removing old generic metadata. Before facets were introduced, TDA generic metadata was in EAD for collections/series/born-digital items and simple Dublin Core for digitized items. After facets became available, to create a more streamline approach to search, Qualified Dublin Core was adopted because it could accommodate standard Dublin Core and crosswalk EAD elements. Before the API was available there were too many items with EAD in their metadata and no way to get rid of the old EAD metadata at-scale. Using the API, we were able to strip off the old EAD and place standardized Qualified Dublin Core in its place.
- Large-scale generic metadata assignment: The TDA has an internal standard that, regardless of whether a folder/item is rendered to the general public, it is assigned core Qualified Dublin Core metadata (collection name and preferred citation). This lets external and internal users always be able to facet on what collection an item has come from and know what the citation is supposed to be should the need arise. For large-scale ingests (10,000 items or more, for example) this standard can be problematic for the processing power of the machine I use for the SIP Creator tool. In addition, the SIP Creator only lets users assign one piece of generic metadata per item during ingest and occasionally more detailed other generic metadata types such as special correspondence or email metadata takes precedence. A script was developed to assign metadata en-masse using the API. At the time of this script’s first development bulk metadata edit was not part of Preservica v5 so using a graphical option was not available (it was added later but seemed to be less applicable as sometimes there is only a smattering of items missing the core metadata within a set of several hundred/thousand). This process has usually been after a harvest has completed during a retrospective evaluation. Assignment during data harvest was recently developed to handle a collection of several hundred thousand scans for a collection of digitized reel-to-reel films to take care of the two tasks at the same time.
- Multipart asset upload: This is the newest use put to the API. In Preservica v5 it was possible to emulate a multi-page digitized item by uploading master scans as preservation files and a single item as a presentation file. This many-to-one upload function was lost in the upgrade to v6. This type of item is very common for our digitization program and many thousands of items were put on hold as a result. In early 2020 James Carr of Preservica sent out python code examples for creating and ingesting a multipart asset using the Upload Content API.
The original script suffered occasionally from connectivity issues with Amazon AWS as well as scale issues when contemplating large sets of items (the initial version required targeting one item at a time). After some experimentation I adapted James’ script to crawl through a directory structure of presentation and preservation files (one presentation item per lowest level folder and a mirrored preservation file structure holding master scans), create multipart assets from the items, attempt to upload them using the Upload Content API, and refuse to move to the next item until an item is confirmed as being completely uploaded. This script is self-encapsulated to prevent the need for an external configuration file. Using this new tool we have been able to upload about 1,000 items thus far with several thousand more to come in the near future.
What steps do you use to work out a new process?
It starts with a sketch pad and a big question of what I want to accomplish. Working with the API is all about programming, so from the initial question I work out the individual tasks needed to get to the results using Boolean-type logic as much as possible. For example, if I want to add a title to the metadata for items missing that data point where I have already used the API to harvest data and metadata as separate files, the sketch might look like this:
Result: Assets without titles in rendered metadata get titles assigned.
- Test for title in the metadata file
a. Crawl directory of harvest files and isolate metadata files
b. Read metadata files one at a time for the script
c. See if the title tag text exists in the file text
- If title not present, get title data from another metadata element
a. Parse metadata file to get individual tag content
b. Find the tag I want to pull the title out of, such as filename
c. Save that tag text to memory and manipulate if needed, like dropping a file extension
- Add the title
a. Use a replace function to add the opening tag, title text, closing tag, and whatever is being replaced
e.g. replace “</dcterms:filename>” with “</dcterms:filename><dcterms:title>some title</dcterms:title>
- Upload metadata changes
a. Use the entity API post function to update the metadata (usually I make many changes to metadata across the system at once so this is done as a separate process in bulk).
Once the logic has been worked out, I figure out how to test whether a major step (indicated by the numbers above) has been successful. If the test is successful, I add the next major step until the process is done and ready for production use. If the script can affect many items in a way that cannot easily be undone, I upload a set of test files to the TDA and harvest data for those beforehand as a guinea pig for the new process until it is worked out. The files are then removed from the system.
Any other thoughts?
Once you get the hang of using the Entity API in coordination with programming it is very surprising how much can get done on a scale not possible by other means. At one item updated by hand per minute, an overly optimistic estimate, working 40 hours per week and 52 weeks per year; in the 2 years since the 2018 Global usergroup meeting I have completed decades worth of work, more often than not as background work concurrent with other digital preservation tasks.