Using OPEX and PAX for Ingesting Content
Preservica has developed the concept of an OPEX (Open Preservation Exchange) package, a collection of files and folders with optional metadata, as a way to organise content into an easy to understand format for transfer into or out of a digital preservation system. Although we have created it, we hope suppliers of digital content to be preserved, and other digital preservation systems, will use it due to its simplicity.
The core idea of OPEX is that a collection of files within a directory structure is already a valid OPEX. That means that any content you already have, or are generating (e.g. a digitisation project), will be valid OPEX and you can ingest it.
Next to any file, or within any directory in the OPEX, you may optionally put an XML metadata file (with the .opex extension) with more information about the file or directory, like descriptive metadata or a manifest to say what is in the OPEX.
You can find more information on the OPEX format in this article. There is also supporting information with more details about OPEX, OPEX metadata and how to use our OPEX ingest processes available through our support desk or from your CX contact.
This article is not a reference for everything you can do with OPEX, or a complete guide.
There are two ways you can get content in an OPEX into Preservica.
If your content is small, you can zip up the contents of your OPEX directory (the top level directory containing the files and directories you want to be ingested), and submit that zip file to Preservica as a package, via any of the normal ingest mechanisms, e.g. by running the standard ingest workflow against it via Explorer (this only works since Preservica 6.2).
For larger amounts of content, or where you have content being generated on an ongoing basis, we've created a new workflow to ingest content directly from an ingest source (e.g. S3 bucket), rather than ingesting packages of content. You can copy your content up to the ingest source, without zipping it, and ask Preservica to ingest it – the workflow will manage batching up the content and retrieving new content as you write it. This workflow is called "Ingest OPEX (Incremental)" and you can find more details in the Standard Workflow document (this has been available since 6.1 on our cloud systems for EPC and EPCP customers).
Example: Digital Image Preservation
Imagine this scenario: your organisation has been asked to preserve a collection of digital images, At the moment, they're stored on an old network drive which your IT administrator wants to decommission. You're using Preservica in the cloud, so first you'll want to transfer content to an S3 source location.
You could create a ZIP package, and upload that. But there are a lot of these images, so making that package would take time, you'd need somewhere to put it, and you'd have to package everything up before you can start ingesting. You could write a script to make packages for groups of 100 images and upload them (customers have done something similar in the past), but that script is not easy to create at a production level. So you decide to use OPEX as your ingest format, and set up an OPEX ingest workflow in your Preservica.
Just Get This Stuff In Please
This is the simplest way to ingest content. Set your OPEX Ingest workflow up to automatically trigger, and not to require a folder manifest (because you aren't sending any accompanying metadata), start a bulk copy of the content you want to upload into your S3 bucket, and walk away. Assuming the upload doesn't get interrupted, it will gradually copy content up to S3, and Preservica will process and ingest it incrementally, creating the appropriate assets and folders as it goes.
A simple incremental OPEX ingest progressing
Asset Metadata (.opex files)
Before you start this ingest, one of your colleagues tells you that they have a catalogue of titles and descriptions for all these images. Instead of labelling your assets with the original file names from the camera, you'd like each asset to receive a title and description from this catalogue.
You can write a script to take the data from this catalogue file and generate metadata files for each image:
DESC0013.jpg.opex <OPEXMetadata xmlns="http://www.openpreservationexchange.org/opex/v1.0"> <Properties> <Title>St. Hugh's (main quad)</Title> <Description>An image of the interior of St. Hugh's College, Oxford, from the main entrance</Description> </Properties> </OPEXMetadata>
... and put those metadata files next to the content files. If you run the ingest like this, your assets will pick up the metadata files:
Incremental ingest with asset metadata
You can also create .opex metadata files for directories. They should have the same name as the directory and be placed inside it (for example, the oxford-photos directory may have a metadata file at oxford-photos/oxford-photos.opex).
These .opex metadata files may also contain descriptive metadata fragments for both files and directories:
DESC0013.jpg.opex <OPEXMetadata xmlns="http://www.openpreservationexchange.org/opex/v1.0"> <Properties> <Title>St. Hugh's (main quad)</Title> <Description>An image of the interior of St. Hugh's College, Oxford, from the main entrance</Description> </Properties> <DescriptiveMetadata> <oai_dc:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ oai_dc.xsd" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <dc:title>St. Hugh's (main quad)</dc:title> <dc:creator>John Bradley</dc:creator> <!-- ... etc --> </oai_dc:dc> </DescriptiveMetadata> </OPEXMetadata>
Transfer Ordering and Manifests
If you do the process described above with images, it will probably work as you expect, because images are quite small but there is a potential problem. What if DSCN0013.jpg is copied up to the bucket, gets processed by Preservica, and then, some time later, DSCN0013.jpg.opex is sent? The asset's already been created, ignoring the metadata that should have been with it. This can happen with folders too – Preservica creates folders when it starts ingesting assets for the files inside the matching directory, so if the oxford-photos.opex metadata file isn't uploaded before Preservica starts ingesting files inside the directory, a folder will be created before it knows about the metadata.
To reduce the risk, there is a delay before Preservica will attempt to ingest a file (it's also because, if you're using a disk rather than cloud storage, files can appear before they're fully written, so we wait for the file to stop being touched before trying to read it). But there's no time that is guaranteed to be long enough to avoid this issue – what if you are uploading 30GB video files?
If you're in full control of the copy process, you can avoid this by sending the files in the correct order. The OPEX metadata file for the directory should come first, and for each content file, its metadata file should come before it. In the example above, in the 2010 directory you should send: 2010.opex, DSCN0013.jpg.opex, DSCN0013.jpg, DSCN0014.jpg.opex, DSCN0014.jpg, and so on.
However, if you're using some automated copying process, you can't guarantee that files will be copied in this order. In particular our experience shows that using Cloudberry to upload content to S3 does not do the copy in a known order. In this case, you should send directory OPEX metadata files with manifests written in the transfer part of the .opex, which list what is expected within the directory, and select Require folder manifests on the ingest workflow.
2010.opex <OPEXMetadata xmlns="http://www.openpreservationexchange.org/opex/v1.0"> <Transfer> <Manifest> <Files> <File type="content">DSCN0013.jpg</File> <File type="metadata">DSCN0013.jpg.opex</File> <File type="content">DSCN0014.jpg</File> <File type="metadata">DSCN0014.jpg.opex</File> <!-- ... etc --> </Files> </Manifest> </Transfer> </OPEXMetadata> oxford-photos.opex <OPEXMetadata xmlns="http://www.openpreservationexchange.org/opex/v1.0"> <Transfer> <Manifest> <Folders> <Folder>2010</Folder> <Folder>2011</Folder> <!-- ... etc --> </Folders> </Manifest> </Transfer> </OPEXMetadata>
A manifest can contain files (including .opex metadata files for the content files) and folders, and of course you can also include properties and descriptive metadata in these OPEX metadata files if you wish.
When the OPEX Ingest workflow reads a metadata file with a manifest, it won't try to download anything or navigate into any subdirectories that aren't in the manifest. That means you can upload the files in any order you like – as long as you give Preservica the metadata file with the manifest early enough that it knows about it. By selecting the Require folder manifests option, you tell the workflow not to start ingesting from a directory at all until the directory's OPEX metadata file has been written.
Manifests also provide you with a level of robustness. If everything you were trying to upload was not copied successfully into the cloud for processing, the OPEX ingest will generate warnings in the monitoring app to tell you what it expected but was never found. This scenario can happen if the upload is interrupted and the workflow completes, but then the remaining content is written later. Use Resubmit on the monitor app to process the OPEX again. We therefore strongly advise that if you're writing some script or process to generate OPEX metadata files, you should write manifests into the directory metadata files.
SourceID and Matching
If you are regularly archiving from a system such as SharePoint, it is useful to be able to match back to a unique reference in the source system.
In the Transfer element of an OPEX metadata file you may also specify a SourceID:
oxford-photos.opex <OPEXMetadata xmlns="http://www.openpreservationexchange.org/opex/v1.0"> <Transfer> <SourceID>07d658626d4f392b3f012d6c77676dce</SourceID> </Transfer> </OPEXMetadata>
This will create a SourceID identifier in Preservica containing the ID in the source system, but it's also used for matching:
- If a folder with the same source ID already exists, it will be re-used rather than a new folder being created. This lets you ingest a complex hierarchy through multiple OPEXes. If you don't specify source IDs, the name of the folder will be used to look for a match within the appropriate parent folder.
- By default, if an asset with the same source ID already exists anywhere in Preservica, the file won't be processed or ingested. This means you don't have to be so careful about excluding content if you need to rerun an acquisition process. Other options to match assets locally or not to match are possible.
See the workflow documentation for more details about matching.
Monitoring OPEX Ingests
The OPEX ingest process is a workflow, so you can follow it through the normal workflow progress page (Ingest > Running). But it is a workflow that starts lots of child ingest workflows, and following the whole process that way is difficult. We also provide the monitoring app (at /monitor) to EPC customers and above, on request, to give you a view of the ingest of the OPEX without having to inspect multiple workflow contexts.
The Top Panel of the Monitor shows a “card” per OPEX ingested (you will also see standard ingests, since 6.2, and replacement processes, since 6.2.1, in this page). The top left card will be the most recent. The order is left to right and then top to bottom. The card contains:
- The name of the process (e.g. the OPEX directory or ingest package name)
- The status, in icon form
- The number of files/folders ingested
- The size of the ingest
- The number of errors and warnings for the process, if any
The Bottom Panel will show a line per message. You can filter the list of messages with the controls in the centre, or filter by process by clicking on process cards to toggle them in and out of the selection.
This information is also available through an API: /api/processmonitor.
The idea of OPEX is that it isn't Preservica-specific; the information you can put in an OPEX metadata file about a file or directory are general concepts about object properties, descriptive metadata, external identifiers and so on, which could be used by any digital preservation or content management system.
However, when you know you're working with Preservica, or copying information out of one Preservica into another, sometimes you want to take advantage of Preservica-specific features: maybe you want to record a previous generation of content, or you want to link files together into a single asset, or link together a preservation and access representation.
The purpose of PAX (Preservica Asset Exchange) is to represent a simple but complete asset so it can be ingested into or transferred between Preservica instances. It's a zip file with a convention for directory and file structure which corresponds to the representation and (optionally) generation structure in Preservica.
Since 6.1, exports are generated in this OPEX format with Assets structured as PAXes.
Example: Digitised Book Preservation
As well as a collection of pictures, your colleague has also found a digitised copy of a seminar programme. It has TIFF images for each page of the pamphlet, and a PDF of the whole thing. You could just ingest them as four separate assets, but you want to link them into one. You can do this by creating a PAX:
Content before arrangement seminar-programme/ page1.tiff page2.tiff page3.tiff page4.tiff programme.pdf
PAX seminar-programme.pax.zip Representation_Preservation/ page1/ page1.tiff page2/ page2.tiff page3/ page3.tiff page4/ page4.tiff Representation_Access/ programme/ programme.pdf
Preservica seminar-programme (Asset) Preservation representation page1 (CO) page1.tiff (Bitstream in generation 1) page2 (CO) page2.tiff (Bitstream in generation 1) page3 (CO) page3.tiff (Bitstream in generation 1) page4 (CO) page4.tiff (Bitstream in generation 1) Access representation programme (CO) programme.pdf (Bitstream in generation 1)
Creating a PAX for a complex asset
The PAX (seminar-programme.pax.zip) can be placed in an OPEX anywhere a content file is placed, and you can provide .opex metadata for it (in an adjacent file seminar-programme.pax.zip.opex as normal). The directory and file names will be used to name the Content Objects (COs) and bitstreams respectively, and COs will be created in alphabetic order. If the alphabetical folder order is not the one you want, then add a numeric prefix or similar to ensure you get the order you require.
Note that (at least in Preservica 6.2) the ZIP you create must include ZIP entries for the directories. This means you can't use the built in Windows zip tool to create them – it doesn't create those entries so you'll get an empty asset if you do. Personally Ilike 7-zip.
It's also possible to include an XIP document in a PAX. This lets you specify all the internal structure and XIP properties of the asset, for example renaming or reordering COs, or putting content files in a different structure, or labelling representations. Much of the information you could specify in an XIP document is ignored or, for the asset, overridden by OPEX metadata; see the documentation page for more details.