• Resources
  • Preservation Asset eXchange (PAX)

Preservation Asset eXchange (PAX)

How to use PAX to ingest and export Digital Preservation assets

Preservation Asset eXchange (PAX) Files

Digital preservation assets can be more complex than standard content files. The most obvious example is content that is created as a single file, for example a document, image or video. During its preservation period it may be migrated to new formats for different purposes. For example, a large AVI video file may be migrated to a compressed MP4 file for streaming, or an elderly WordPerfect file may be migrated to the latest Microsoft Word format for editing and a PDF file for distribution.

Other forms of information may need several files to represent them, for example a video with captions, a SharePoint Library record with structured column data and a content file, or a Tweet with a database record and images and a video. These individual files may also be migrated creating even more files that are contained inside the asset.

PAX is a protocol used for the transfer of these assets between information management systems allowing a Producer to tell the Consumer “this set of files are a single asset”. It can be combined with Open Preservation Exchange (OPEX) files to allow the metadata for an asset to be placed in an OPEX file which is linked by naming convention to an asset held in a PAX structure. For more on OPEX see How to use OPEX for metadata ingest and export.

Although defined in a system independent way, PAX is used by Preservica to export digital preservation assets and is optional for users to import assets with a more complex structure.

PAX variants

There are three variants of PAX that allow some flexibility when using PAX for transfer:

Compressed file with no XIP structure file: In this case the meanings of all the components is defined by the names of the folders to direct the consumer to their meaning. The alphanumeric order of the folders or files may be important when directing the consuming system where the content objects need a particular order. Again, compression is ZIP or TAR and have a .pax.zip or .pax.tar suffix.

This is the most common form used for ingest into Preservica.

Folder structure with no XIP structure file: The PAX structure is contained in a local folder structure rather than a compressed file. There is always an accompanying OPEX file at the same level to indicate this is an asset folder rather than a regular folder.

This is most suitable where the asset contains very large files that are slow to compress.

Using a XIP structure file: In this form, the internal structure of the asset is described using an XML file in XIP format that tells the consumer what each of the files inside the structure mean, for example which is the original, which is the latest preservation master and which is appropriate for access. This can be used with either the compressed ZIP file or the folder structure approaches described above.

This is the form of PAX used by Preservica for export.

When is PAX appropriate?

PAX not always required to transfer Digital Preservation assets. In Preservica it is recommended when you wish to:

  • Export assets where some or all of the files have been migrated from one format to another.
  • Export assets where the detailed internal information on the asset is required in an XML file.
  • Ingest images that have been migrated or generated multiple representations externally pre-ingest, for example a scanning process that has produced detailed and compressed images.
  • Ingest multi-part assets such as tweets, scanned books, or captioned video.

Preservation Asset Internal Structure

The internal structure of an asset is what gives power to a Digital Preservation system. It allows a single piece of information to appear as a single item in a folder and to have a single set of descriptive metadata or access control on what is internally several files for different purposes.

Not all DP systems support the powerful and rich variety of asset types provided by Preservica.

At the top level within the asset is the “Representation”. Representations are different ‘views’ of the same logical content. Typically, there will be up to two Representation for different purposes:

  • Preservation: This is the digital master set of information. It contains the original file or files ingested and may contain migrated files that are still intended to be the best quality of information that is as close as possible to the original ingested content.
  • Access: This is a potentially lower quality file or files that are created for distribution. Examples include PDF, compressed or downscaled images, or streaming video or audio.

Each Representation can contain one or more “Content Objects”. These are the basic building blocks of the information that makes up the asset. In Simple Assets which comprise the majority of cases, for example documents, spreadsheets, images, and video, there is only one Content Object per Representation.

In some cases, more than one content object is required to represent the asset information. These Multi-Part Assets include 3D objects, captioned video, SharePoint records (Lists and Libraires), Teams Posts, Tweets, scanned books with multiple pages, and emails with attachments. The combination of files can be identified as a “Representation Format” which allows Preservica to validate the package and to provide a renderer.

For each Content Object there may be several “Generations”. A new generation is created when one format is changed to another for the same purpose. An example might be converting Word for Windows 95 to Word for Windows 2013 for editing or converting a Lotus 123 Spreadsheet to Excel 2013. Each generation typically has one “bitstream”, usually held in a file.

Identifying a Representation Format for Multi-Part Assets

For assets that are made up of multiple content files it is useful to identify the combination of files as a single format. This can be used to validate the content – for example for a 3D object make sure all the files you would expect and that are internally referenced are there. It also allows a system to offer a render tool to allow users to display and interact with the entire asset rather than each component, for example displaying multiple images with a book viewer.

Representation formats are identified by the order and type of content files within them, as listed below in the Annex. For example, a Tweet is defined as having an initial JSON file that conforms to the Twitter API protocol, and a number of images, videos and thumbnails.

When using PAX to transfer assets you wish to identify with a specific representation type, the order of content objects may be important. When using XIP the order of objects in the XML is used, and when using files and folders (compressed or uncompressed) the alpha-numeric ordering is used. This is further discussed below.

Example Asset Structures

Below are some examples of internal asset structure used to represent specific situations. PAX transfers are intended to allow the Consumer to understand and re-create these structures.

1. A simple, unmigrated asset with a single content file, for example a JPEG image file, would be structured as follows:

2. An asset that has been migrated to create a new preservation (digital master) generation, for example a WordPerfect document, would look like this:

3. An asset where we have produced just a new access copy for distribution, for example a large TIFF, would look like this:

4. An asset where we have done both #2 and #3 would look like this:

5. An asset with multiple content objects, for example a scanned document with three pages, would look like this:

6. A multi-part asset that has also been migrated to create an access copy for distribution would look like this:

7. An asset with multiple content objects with different types such as a tweet with a video embedded would look like this:

Compressed file with no XIP structure file

The format most commonly used for PAX ingest is a compressed ZIP file that contains a folder hierarchy that mimics the internal structure of an asset. The file name is asset.pax.zip. The asset title will be the file name by default, but can also be defined in an associated OPEX file, for example alongside 1306107.pax.zip, place a 1306107.pax.zip.opex which contains a <Title> field.

The folder names at the top level are either Representation_Preservation or Representation_Access, that map on to the appropriate internal representation. They can optionally have a number at the end to allow for multiple representations of the same type, so Representation_Preservation_1 or Representation_Access_2

At its simplest the next level down can be a set of files that map onto content files, for example the following would create a content object for each file. Note that in this example the representation will only be identified as a Tweet if the Tweet file comes alpha-numerically first so care must be taken with the file naming to make his happen.

This would also be used for ingesting both a Preservation and Access representation produced pre-ingest as part of a digitization process.

A scanned book could be done in the same way. In this case the alpha-numeric order of the pages is used when rendering the book in a render, so care needs to be taken when naming the files.

If you wish to transfer multiple generations the content files can become folders and which contain sub-folders name Generation_# where # is the generation number that each contain one file.

Folder structure with no XIP structure file

Where it is too slow to compress a PAX structure it is possible to keep it as a folder with the same structure as for the compressed asset detailed above. There must always be an accompanying OPEX file for the asset folder to indicate this is not a regular folder, but apart from that all the same rules apply as if it was compressed.

Using an XIP structure file

The most complete way of transferring an asset such that its internal structure is specifically defined rather than implied from folder names and file ordering is to include an XIP file that explicitly gives the details of objects and links to content. The XIP file also optionally includes details on the properties of the components, for example the file format and file properties created by the preservation system. The following applies whether the PAX is compressed or presented as a folder structure as described in the previous two sections.

XIP is a complex and strict protocol that is defined in the Annex below. If used the folder structure is completely optional so the following would be perfectly acceptable.

When exporting assets within a PAX file, Preservica seeks to use the best of both of these approaches. It includes the XIP file but also builds a folder hierarchy to assist in the human interpretation of the contents. A typical example would be:

Annex: Supported Representation Types

The following representations are recognized by Preservica and used to preserve and present the information. Some are a sub-set of others in which case the more specific format wins. New Representation Types are added regularly.

Name First Content Object Other Content Objects
Tweet A Tweet API record in JSON format Zero to 4 images (JPEG, PNG) and zero to one MP4 videos
Email (standard) Anything in the “internet mail” format family (EML, MSG) Zero to many attachments of any format
Email (generic) Anything in the wider Email format family Zero to many attachments of any format
Web archive 2 or more WARC files
Renderable multi-image 2 or more renderable images
Any multi-image 2 or more images
Captioned video MP4 or WebM media file One or more caption files in VVT or SRT format. Zero or more MP3 audio files.
OBJ 3D object Wavefront OBJ file Zero or more Wavefront OBJ files. Zero or more Wavefront Materials files. Zero or more Image files.
GTLF V1.0 3D model GTLF (Text) v1.0 Zero or more GTLF (Text) v1.0 files. Zero or more Binary files. Zero or more Image files.
GTLF V2.0 3D model GTLF (Text) v2.0 Zero or more GTLF (Text) v2.0 files. Zero or more Binary files. Zero or more Image files.
SharePoint List Item SharePoint List structured data in JSON format Zero to many attachments of any format
SharePoint Library Item SharePoint Library structured data in JSON format Zero to many attachments of any format

Annex: XIP Structure

A full description of XIP can be found in the Logical Data Model in the documentation. The following is a quick guide only, and explains how XIP documents are interpreted as part of a PAX when ingested into Preservica.

XIP Sections

The XIP file has types of section (see the example below) each describing a different aspect of the asset structure. The sections are:

<InformationObject> Information about the asset itself

<Representation> Information about each representation

<ContentObjects> Information about the content objects in a representation

<Generation> Information about each generation of a content object

<Bitstreams> Information about the files that make up a generation.

The fields within these sections are defined below.

InformationObject

An XIP inside a PAX must have exactly one InformationObject element, as a PAX represents a single asset.

Field Rules Description
Ref Mandatory Used to link to Representations and Content Object but replaced when ingested into Preservica
Title Mandatory Will be used instead of the PAX file name but is replaced by Title in an OPEX file if present
Description Mandatory Is replaced by Description in an OPEX file if present
SecurityTag Optional Must match a tag already in Preservica. Is replaced by SecurityTag in an OPEX file if present
CustomType Ignored
Parent Ignored

Representation

Field Rules Description
InformationObject Mandatory Must match the Ref in the Information Object section
Name Optional
Type Mandatory Usually Preservation or Access
ContentObjects Mandatory List of child Content Objects

ContentObjects

Field Rules Description
Ref Mandatory UUID of the Content Object, must match one of the content objects in the ContentObject list in one Representation
Title Mandatory Usually the
SecurityTag Optional Must match a tag already in Preservica
CustomType Optional
Parent Mandatory Must match the Ref in the Information Object section

Generation

Field Rules Description
ContentObject Mandatory UUID of the Content Object for this generation
Label Optional
FormatGroup Ignored
EffectiveDate Mandatory The date this generation become effective, important for ordering generations
Bitstreams Mandatory A list of the full pathname of each bitstream, matching the PhysicalLocation and Filename of a Bitstream
Formats Ignored
Properties Ignored

Bitstream

Field Rules Description
Filename Mandatory Filename of the file inside the PhysicalLocation folder inside the PAX file
FileSize Mandatory Size of the file in bytes
PhysicalLocation Mandatory Folder name inside the PAX containing the Filename
Fixities Mandatory A list of fixities and their algorithms

Linking Between Objects

Example

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
<XIP xmlns="http://preservica.com/XIP/v6.0"> 
	<InformationObject> 
		<Ref>d77255da-9c3c-478e-a330-31b1d840ce1a</Ref> 
		<Title>fmt-40 (Microsoft Word Document 97-2003)[1]</Title> 
		<Description/> 
		<SecurityTag>open</SecurityTag> 
		<Parent>a9d84619-488e-4773-84d2-647a21e44a4f</Parent> 
	</InformationObject> 
	<Representation> 
		<InformationObject>d77255da-9c3c-478e-a330-31b1d840ce1a</InformationObject> 
		<Name>fmt-40 (Microsoft Word Document 97-2003)[1]</Name> 
		<Type>Preservation</Type> 
		<ContentObjects> 
		<ContentObject>d32d889d-1230-4d11-a8ab-b9052f3a9ce6</ContentObject> 
		</ContentObjects> 
	</Representation> 
	<ContentObject> 
		<Ref>d32d889d-1230-4d11-a8ab-b9052f3a9ce6</Ref> 
		<Title>fmt-40 (Microsoft Word Document 97-2003)[1]</Title> 
		<Description/> 
		<SecurityTag>open</SecurityTag> 
		<CustomType>document</CustomType> 
		<Parent>d77255da-9c3c-478e-a330-31b1d840ce1a</Parent> 
	</ContentObject> 
	<Generation original="true" active="true"> 
		<ContentObject>d32d889d-1230-4d11-a8ab-b9052f3a9ce6</ContentObject> 
		<Label>fmt-40 (Microsoft Word Document 97-2003)[1]</Label> 
		<FormatGroup>microsoft-word</FormatGroup> 
		<EffectiveDate>2020-07-27T15:27:03Z</EffectiveDate> 
		<Bitstreams> 
			<Bitstream>Representation_Preservation_1/fmt-40_Microsoft_Word_Document_97-20031/Generation_1/fmt-40_Microsoft_Word_Document_97-20031.doc</Bitstream> 
		</Bitstreams> 
		<Formats> 
			<Format valid="false"> 
				<PUID>fmt/40</PUID> 
				<Priority>1</Priority> 
				<IdentificationMethod>Container</IdentificationMethod> 
				<FormatName>Microsoft Word Document</FormatName> 
				<FormatVersion>97-2003</FormatVersion> 
				<Warnings/> 
			</Format> 
		</Formats> 
		<Properties> 
			<Property> 
				<PUID>prp/26</PUID> 
				<PropertyName>Character Count</PropertyName>
				<Value>70042</Value> 
			</Property> 
		</Properties> 
	</Generation> 
	<Bitstream> 
		<Filename>fmt-40_Microsoft_Word_Document_97-20031.doc</Filename> 
		<FileSize>1560064</FileSize> 
		<PhysicalLocation>Representation_Preservation_1/fmt-40_Microsoft_Word_Document_97-20031/Generation_1</PhysicalLocation> 
		<Fixities> 
			<Fixity> 
				<FixityAlgorithmRef>SHA1</FixityAlgorithmRef> 
				<FixityValue>e11b8fb95649467496356fd635ca70782ffc28c8</FixityValue> 
			</Fixity> 
		</Fixities> 
	</Bitstream> 
</XIP> 

More on the Community Hub

Please log in to the Community Hub to access the following resources:

Preservica on Github

Open API library and latest developments on GitHub

Visit the Preservica GitHub page for our extensive API library, sample code, our latest open developments and more.

Preservica.com

Protecting the world’s digital memory

The world's cultural, economic, social and political memory is at risk. Preservica's mission is to protect it.