• Resources
  • Open Preservation Exchange (OPEX)

Open Preservation Exchange (OPEX)

How to use OPEX for metadata ingest and export

The Open Preservation Exchange format is a mechanism for importing and exporting metadata between two Preservation systems, and is the principal mechanism used by Preservica for ingest and export. It is designed to allow it to be flexible, and to be incomplete and still useful.

Key Principles

  • OPEX facilitates the exchange of metadata attached to files and folders between a Producer and a Consumer. The Producer decides on what they export and the Consumer should handle the metadata however complete.
  • It is designed to be independent of Preservica but is used by Preservica for both ingest (Preservica is the Consumer) and export (Preservica is the Producer).
  • OPEX metadata is held in an XML file to describe a single file or folder.
  • The OPEX metadata file is associated to the file or folder it describes using a naming convention.
  • OPEX metadata files may be present for all, some, or none of the files and folders in an ingest or export.
  • All sections in the OPEX metadata (Properties, History, Transfer etc) are optional.
  • Within each section each field is also optional as decided by the Producer.
  • For folders, the transfer section and the manifest it optionally contains only describes the next level down, not the entire structure below. This means part of the structure can be taken and used independently of its parents.
  • An OPEX import or export may contain content files as “normal” single files, for example a document, image or video. It may also contain “Preservation Asset Exchange” (PAX) files or folder structures to transfer assets that have a more complex internal structure, for example assets have multiple content objects e.g. a captioned video, or assets that contain migrated files. See Preservation Asset eXchange (PAX) for more information.

Examples of where OPEX is useful

The following are examples where OPEX can be used during information transfer:

Task How OPEX helps
Exit from one Digital Preservation system to another The export can include all the content from the original DP system, especially when using PAX for the original and migrated files, and ingest can be validated, ensuring nothing is lost.
A large scale “Extract-Transfer-Load” (ETL) process The export metadata format can be converted to a format supported by the Consumer e.g. Preservica and attached to each item.
Checking a file has not been changed during transfer The fixity can be re-calculated to make sure the file wasn’t altered during transfer.
Checking all the files were transferred The folder manifest can be checked to ensure the expected files are present and none have been added.
Adding externally created metadata to an item before ingest The external metadata can be added to the Descriptive Metadata in the OPEX file.
File or Folder name contains special characters that are not supported on the new or transfer system The file or folder name can be changed and the required Title and Original Filename containing any UTF8+ characters added to the OPEX file for the asset or Folder.

File and Folder Structure

The opex file describing a folder called “foldername” is named “foldername.opex” and sits as a child of this folder.

The opex file for a content file called “filename.suffix” is “filename.suffix.opex” and sits in the same folder as the content file.

An example structure is shown below. All opex files are optional so the ingest or export structure can contain all, some or none of the opex files.

OPEX sections

Overview

OPEX is modular and very flexible and contains five optional sections, each with its own purpose.

Section Description Use on Preservica ingest / export
Properties Basic generic properties of the asset or folder. Map onto basic XIP fields, used in ingest and export.
Transfer Data to assist the assessment of whether the item is complete and correct. Check all items in a folder are present and whether file transfer can be trusted, used in ingest and export.
DescriptiveMetadata Metadata in an external XML schema. There can be several of these sections. Mapped onto the Preservica descriptive metadata schemas, used in ingest and export.
History Contains an audit history of what has happened to the item in the producing system. Optionally added to the export, not used during an ingest.
Relationships Defines the links between objects. Added to the export, not used during an ingest.

The order and presence of the sections is optional, and the full layout would look something like this:

<?xml version="1.0"?> 
<OPEXMetadata xmlns="http://www.openpreservationexchange.org/opex/v1.0"> 
  <Properties></Properties> 
  <Transfer></Transfer> 
  <DescriptiveMetadata></DescriptiveMetadata> 
  <History></History> 
  <Relationships></Relationships> 
</OPEXMetadata>

Properties

The properties section holds basic generic descriptors of the asset or folder, such as title and description. The sections are the same for folders and assets and are:

Field Description Use on Preservica ingest / export
Title The title of the asset or folder Used for the Title field. If missing on ingest the file name (without suffix) or folder name is used.
Description The description of the asset or folder Used for the Description field. If missing on ingest will be set to blank
Identifiers A list of identifiers. Don’t use this for SourceID identifiers; those should be in Transfer (see below)
Identifier An identifier with a type and a value Maps onto the Identifier for the asset or folder
SecurityDescriptor A string describing the access control for the asset or folder Used to map onto the Preservica security tag. If missing on ingest the parent value will be used. If the tag does not exist the item will not be ingested.

A fully formed example is shown here:

<Properties> 
    <Title>Board Minutes December 2023</Title> 
    <Description>Final approved minutes for December</Description> 
    <Identifiers> 
      <Identifier type="Statutory">Board/Mins/2023/Dec/Final</Identifier> 
    </Identifiers> 
    <SecurityDescriptor>confidential</SecurityDescriptor> 
  </Properties> 

Transfer

The transfer section is used to allow the ingesting system to validate that what the exporting system sent is fully present and has not been changed during the transfer. There are different fields for files and folders.

Field Description Use on Preservica ingest / export
SourceID A unique identifier from the exporting system (files and folders) On export this contains the Preservica unique identifier (UUID). On ingest this can optionally be used for de-duplication of assets or folders, either locally (in this folder) or globally (in the entire system). See below for details.
Fixities A list of fixities for the file (files only)
Fixity The checksum of the file. The type attribute identifies the algorithm, either MD5, SHA-1, SHA-256 or SHA-512, and the value attribute contains the checksum. On export this contains the fixity of the export file which may be a zip of all generations and representations. On ingest this can be used to validate the transfer. If the calculated fixities are different during the ingest process the asset is rejected.
Fixity (PAX) An optional “path” attribute can be added to the fixity that allows the definition of fixities of file within a Preservation Asset Exchange (PAX) file. This is discussed in greater depth below. As with simple fixity, this is used to validate ingest of items inside a ZIP.
OriginalFilename This is the name of the file in the originating system. This enables the transfer file to be renamed to avoid special character clashes during transfer allowing the original name to be retained. You only need it if you don’t want the actual filename to be used. On ingest the OriginalFilename is used to set the Bitstream Filename property for the file being ingested, ignoring the file name of the transfer. This does not work with PAX files. On export the OriginalFilename is included if the file name is different, for example to avoid operating system illegal characters.
Manifest For a folder, this contains a list of the contents of the folder (folders only) On ingest used to validate that everything is present. On export set to the contents of the folder.
Folders A list of the sub-folders
Folder Each sub-folder of this folder
Files A list of the files in this folder. Includes content files mapped onto assets and metadata (opex) files.
File Each file in the folder. Has optional attributes of “size” for the number of bytes in the file and “type” to indicate of this is a content file (an asset) or a metadata file. The size is used on ingest to validate the file transfer is complete.

A fully formed example for a file with simple fixity is shown here:

<Transfer> 
  <SourceID>6168</SourceID> 
  <Fixities> 
   <Fixity type="MD5" value="338d18af09df7857e061692a59723e97"/> 
   <Fixity type="SHA-1" value="173884f47e2eb7991a6c9929c49e4e515158b087"/> 
  </Fixities> 
  <OriginalFilename>Interesting reaction.jpg</OriginalFilename> 
</Transfer>

A fully formed example for a file with fixities inside a PAX is shown here:

<Transfer> 
  <SourceID>616464732-aq</SourceID> 
  <Fixities> 
    <Fixity path="Content/original/mydoc.docx" type="SHA-1"  
            value="173884f47e2eb7991a6c9929c49e4e515158b087"/> 
    <Fixity path="Content/access/mydoc.dot" type="SHA-1"  
            value="392f6ef4318ea5346e02eebdb663138d1cc8b915"/> 
  </Fixities> 
</Transfer>

A fully formed example for a folder containing two content files and their associated opex files and one child folder is shown here:

<Transfer> 
  <SourceID>0e8ce6d4-f8a5-48d9-9207-873dfe819863</SourceID> 
  <Manifest> 
    <Folders> 
      <Folder>Discussion Documents</Folder> 
    </Folders> 
    <Files> 
      <File size="12039" type="content">Mins2023-12-final.docx</File> 
      <File size="1342" type="metadata">Mins2023-12-final.docx.opex</File> 
      <File size="33245" type="content">Accounts2023-12.xlsx</File> 
      <File size="1432" type="metadata">Accounts2023-12.xlsx.opex</File> 
    </Files> 
  </Manifest> 
 </Transfer>

DescriptiveMetadata

The section allows third party metadata in external schemas to be embedded within the OPEX file. There can be any number of these sections as required, each containing its own self-contained XML document. A fully formed Dublin Core example is shown here:

<DescriptiveMetadata> 
  <oai_dc:dc  
     xmlns:dc="http://purl.org/dc/elements/1.1/"  
     xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"  
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ oai_dc.xsd"> 
    <dc:title>Board Minutes December 2023</dc:title> 
    <dc:creator>Brian Brain (Company Secretary)</dc:creator> 
    <dc:subject>Corporate governance</dc:subject> 
    <dc:description>Final board minutes of the Dec 2023 meeting</dc:description> 
    <dc:publisher>MyCorporation plc</dc:publisher> 
    <dc:type>Minutes</dc:type> 
    <dc:identifier>25549</dc:identifier> 
  </oai_dc:dc> 
</DescriptiveMetadata>

Another example shows some Tweet metadata:

<DescriptiveMetadata> 
  <tweet xmlns="http://www.preservica.com/tweets/v1"> 
    <id>1263058848107188225</id> 
    <full_text>Looking forward to the iPres conference next month</full_text> 
    <created_at>2020-05-20T10:47:50.000Z</created_at> 
    <screen_name_sender>dPreservation</screen_name_sender> 
    <in_reply_to> 
       <id_str>1263054922037243904</id_str> 
       <screen_name>Preservica</screen_name> 
    </in_reply_to> 
  </tweet> 
</DescriptiveMetadata>

History

This section allows the transfer of an audit history that the Producing system asserts is everything that happened to the object whilst it was in the system.

This is currently exported by Preservica but not ingested.

Field Description Use on Preservica ingest / export
Events A list of events that happened to the object.
Event A field containing information for a single event. It has attributes of date (when the event happened) and user (who made the change). Exports optionally contain all event information.
Type A field within an Event identifying the category of event. On export, properties include Ingest, Link, Characterise, Re-Characterise, Migrate, UpdateProperties etc.
Action The specific action that occurred.
Detail The specific details of the event in a format that is entirely up to the originating system. Exports contain all the information needed to make sense of the change in JSON form. This includes metadata changes.

An example is shown here but the <Detail> field can vary significantly depending on the Producer:

<History> 
  <Events> 
    <Event date="2020-06-10T12:44:15Z" user=”auto-ingest-user"> 
      <Type>Ingest</Type> 
      <Action>command_create</Action> 
    </Event> 
    <Event date="2020-06-10T12:44:16Z" user="auto-ingest-user"> 
      <Type>Characterise</Type> 
      <Action>command_characterisation</Action> 
      <Detail>{"generations": 
              ["154ed5e9-b031-4754-bd12-841079d34ad0-0"], 
               "toolName":null} 
      </Detail> 
    </Event> 
    <Event date="2020-06-10T13:04:04Z" user="auto-ingest-user "> 
      <Type>Link</Type> 
      <Action>Link</Action> 
      <Detail>{"fromRef":"d240ae38-7bc8-466f-b730-f4ea38832d4e", 
               "toRef":"e2bb56f1-2f21-43df-a0ab-e248a3758982", 
               "linkType":"InReplyTo"} 
      </Detail> 
    </Event> 
    <Event date="2021-01-05T11:15:08Z" user="migration-user"> 
      <Type>Migrate</Type> 
      <Action>command_new_representation</Action> 
      <Detail>{"newType":"Access"}</Detail> 
    </Event> 
    <Event date="2024-01-23T11:08:25Z" user="bob.smith@corporation.com"> 
      <Type>Modified</Type> 
      <Action>UpdateProperties</Action> 
      <Detail>{"changes": 
             [{"propertyName":"description", 
               "oldValue":"Creativity Meeting", 
               "newValue":"Creativity Meeting March 2003"}]} 
      </Detail> 
    </Event> 
  </Events> 
</History>

Relationships

This section specifies the relationship between two objects in the system, for example asset one is a reply to asset two or asset three is an older version of asset four.

This is currently exported by Preservica but not ingested.

Field Description Use on Preservica ingest / export
Relationship The information on the relationship On export this contains the relationships defined in the system
Type The type of relationship (user defined)
Object The unique identifier of the object the relationship is with

A fully formed example is shown here:

<Relationships> 
   <Relationship> 
       <Type>NewerVersionOf</Type> 
       <Object>e2bb56f1-2f21-43df-a0ab-e248a3758982</Object> 
   </Relationship> 
</Relationships>

Preservation Asset eXchange (PAX) Files

Digital preservation assets can be more complex than a single content file. The most obvious example is content that is created as a single file, for example a document, image or video. During its preservation period it may be migrated to new formats for different purposes. For example, a large AVI video file may be migrated to a compressed MP4 file for streaming, or an elderly WordPerfect file may be migrated to the latest Microsoft Word format for editing and a PDF file for distribution.

Other forms of information may need several files to represent them, for example a video with captions, a SharePoint Library record with structured column data and a content file, or a Tweet with a database record and images and a video. These individual files may also be migrated creating even more files that are contained inside the asset.

PAX is a format used for the transfer of these assets between information management systems allowing a Producer to tell the Consumer “this set of files are a single asset”. It can be combined with Open Preservation Exchange (OPEX) files to allow the metadata for an asset to be placed in an OPEX file which is linked by naming convention to an asset held in a PAX structure.

Although defined in a system independent way, PAX is used by Preservica to export digital preservation assets and is optional for users to import assets with a more complex structure.

See How to use PAX to ingest and export Digital Preservation Assets for more information.

Preservica Ingest Processes

Standard Ingest

OPEX files can be included within an ingest package sent to the standard ingest procedure, for example by uploading a ZIP file containing folders, content files and OPEX files.

The OPEX files are then used to attach metadata to the objects within the package.

If included in the OPEX, the fixities and manifests can be used to validate the transfer. Any errors, for example an extra file or missing file or folder, or a file with the wrong size or fixity, are reported and the ingest aborted.

OPEX incremental workflow

For larger uploads the folders, content files and OPEX files can be placed in a shared storage location such as an AWS S3 Bucket or a Microsoft Azure Container. The ingest workflow then breaks the upload down into manageable chunks and loads them incrementally and can be set to start ingest before the upload to the shared location is complete. The OPEX files can be used to attach metadata as with the standard workflow. The manifests attached to the folders can also be used to optimize the process.

There is more in the Get Started guide “Introduction to Ingest at Scale using OPEX & Incremental Ingest”

Preparation and Upload Tool (PUT)

OPEX files can be uploaded into the PUT tool in the same way they can be uploaded directly into ingest.

There is more in the Get Started guide “OPEX Metadata for PUT”

De-duplication on Ingest

De-duplication can be enabled during ingest by setting specific tenant properties. These then ensure that the same asset or folder is not ingested twice.

For assets the SourceID in the Manifest block can be used to de-duplicate assets in ingest by setting the ingest.asset.duplicate.check system property. This can be set to

  • global (the default): files won’t be ingested if there is an existing asset anywhere in the tenancy with this SourceID
  • local: files won’t be ingested if there is an existing asset in the same folder with this SourceID
  • none: no de-duplication, files always ingested

For folders de-duplication is always applied using the folder name, so a new folder will never be created if one exists with the same path and name. In addition SourceID can be used for duplication by setting the ingest.folder.duplicate.check property. This can be set to:

  • SourceID first (default): If a folder exists with a matched SourceID ingest into this rather than the path implied by the OPEX path
  • TitleOnly: Ignore the SourceID and do de-duplication using the folder name only.

Preservica Export Processes

The Preservica export workflows allow users to download a ZIP file containing a folder hierarchy, Preservation Asset Exchange (PAX) compressed files containing the individual files that make up an asset and OPEX files containing the metadata file the files and folders. The workflows contain options to decide of what is put into the download:

  • Compression type: Choose from ZIP (the default), uncompressed ZIP, TAR or uncompressed TAR.
  • Include metadata: This can be set to
    - Metadata: All metadata except audit history added to OPEX files in the download
    - Metadata with audit history: Preservica audit history added to the OPEX files
    - No Metadata: No OPEX files created
  • Include content: This can be set to
    - Content: Include the PAX files containing the asset files
    - Content with migration: Migrate the asset files on exit
    - No content: Don’t create asset files
  • Include all generations. For files that have been migrated to a newer format this decides which of the various generations go into the PAX file. The options are:
    - All: include every generation in the asset file including the original and the latest
    - All active: include all generations that have not been replaced in some way.
    - Latest active: Include the latest generation only.
  • Include parent hierarchy: Check this box to ensure the package contains the full hierarchy of the assets and folders it contains rather than the local hierarchy.
Preservica on Github

Open API library and latest developments on GitHub

Visit the Preservica GitHub page for our extensive API library, sample code, our latest open developments and more.

Preservica.com

Protecting the world’s digital memory

The world's cultural, economic, social and political memory is at risk. Preservica's mission is to protect it.