Open Preservation Exchange (OPEX)
OPEX Format
The simplest form of OPEX is simply a structure of files and directories. Each directory will be translated to a Preservica folder, and each file will become an asset.
You may also associated OPEX metadata with any item in an OPEX. For a file, create an OPEX metadata file adjacent to the file with the extension .opex (for example, for llama.jp2, the OPEX metadata file should be named llama.jp2.opex). For a directory, the OPEX metadata file should be given the name of the directory with the same suffix (i.e. in a directory named Images, the metadata file should be at Images/Images.opex).
There are three main parts to the metadata file:
- Properties - This element defines the title and description of the folder or asset, if you want to override the default (based on the name of the directory or file). You may also specify a security descriptor if you don't want the default of 'open'.
- Transfer - This section contains information about the provenance and content of the item. Set the SourceID element to record the id of the object in the source system; this field is used for matching (see below). For a folder, you may include a manifest, which lists all the files and directories which are direct children of this one (to make a fully verified tree, also include manifests in the metadata files in subdirectories at all levels). For a file, you may include fixity information, which will then be verified by the ingest process.
- DescriptiveMetadata - Sub elements of this element will be added to the folder or asset as descriptive metadata fragments. Make sure you set the XML namespace (with the xmlns attribute) on each fragment.
Everything in an OPEX metadata file is optional so you only need to provide information when the default behaviour is not suitable. However, if you are ingesting incrementally and uploading content into a source with a third party tool (e.g. uploading content to an S3 bucket with Cloudberry) you will usually need to include manifests with your directories.
See the OPEX metadata schema HERE for more details of how to construct the metadata file. Please note that the OPEX schema is published under Free and Open-Source Software (FOSS) licence.
PAX (Preservation Asset Exchange format)
Please note: the following relates to capabilities available in an upcoming release.
Normally, each file in an OPEX will be converted into a simple asset, with one preservation representation containing the single piece of content that is the file itself.
You may also include packaged PAX files to specify a complex or customised asset. Any file with a name ending in .pax.zip will be interpreted as a PAX, and instead of the zip being stored as an asset itself, the contents of the zip will be used to construct a multi-part asset.
The simplest PAXes are arrangements of files and directories within the PAX, according to a convention for directory naming, which specifies different representations within the asset. For example, a simple PAX for an asset with a preservation and access representation of the same information would be similar to:
cover_image.pax.zip Representation_Preservation/ cover_image/ cover_image.tiff Representation_Access/ cover_image/ cover_image.jp2
The name of the representation directories is important and must follow this pattern (including capitalisation). Inside the representation directory is a directory for each Content Object (CO) you want to create in the representation, in this case only one. You may also specify multiple representations of the same type by suffixing with a number (although in this case it's likely that you'd want to include an XIP document to specify a more descriptive label for them):
cover_image.pax.zip Representation_Preservation/ cover_image/ cover_image.tiff Representation_Access_1/ cover_image/ cover_image.jp2 Representation_Access_2/ cover_image/ cover_image.jpg
You can also use PAX to create a multi-part asset, for example of a book where the preservation representation is a TIFF image for each page:
book.pax.zip Representation_Preservation/ page_001/ page_001.tiff page_002/ page_002.tiff page_003/ page_003.tiff etc Representation_Access/ complete_book/ complete_book.pdf
(You may also include generation directories, instead of just files, within a CO directory. These directories should be named Generation_1, Generation_2 etc, and contain the file for the relevant generation. Recall that generations represent the same content which has been preserved for file format obsolescence reasons - see the LDM - so it's unlikely that newly digitised content would have multiple generations, rather than multiple representations.)
If you want to include more information, for example labels for representations or generations, you can also include an XIP document at the root level of the PAX. This file should be named after the PAX and have a .xip extension (i.e. for a PAX called book.pax.zip, the XIP document should be called book.xip), and must follow the v6 XIP schema (see [LDM]).
The XIP document should contain exactly one InformationObject element, and Representation, ContentObject, Generation and Bitstream elements relating to the full internal content of the PAX. Each Bitstream element should reference a file inside the PAX, by relative path from the root of the package file.
Only the following items specified in the XIP document will be used:
- On InformationObject and ContentObject: Title, Description, SecurityDescriptor. The Parent element on ContentObject elements must point to the InformationObject. Content group information will be ignored.
- Descriptive metadata fragments will be included, but in most cases you should put this in the OPEX metadata file alongside the PAX instead.
- On Representation: Type, Name. The InformationObject and ContentObjects references must refer to the IO and COs in the document.
- On Generation: EffectiveDate, Label, Active. The Original status of all generations will be set to true, and all characterisation information (format group, formats and properties) will be ignored. The ContentObject and Bitstream references must be correct within the document.
- Bitstreams, including fixities if provided
- Links, but only if the link is between two entities within the document. Links to other Information Objects or other entities outside the PAX will be ignored (these relationships should be specified outside the PAX).
- Identifiers, but to include identifiers on the asset you should use the OPEX metadata instead.
Any other information within the XIP will be ignored. This includes:
- StructuralObjects (folders): the asset will be placed according to its position in the OPEX directory structure and its transfer metadata, like any other asset in an OPEX
- Audit information (events and event actions): any event history relates to the entity's history in the previous system, not the one into which you are ingesting it
- Characterisation information: the asset will be characterised as normal as part of the ingest process
- Links to entities outside the PAX
- Restrictions: any restrictions apply within the context of the previous system, not the one into which you are ingesting the content