Auto Re-Characterise Blog
Jack O'Sullivan
October 28th, 2024
The last few Preservica releases have come with announcements of Automated Digital Preservation (ADP) features starting to appear in our New Generation Interface.
These are in support of the automated application and re-application of migration policy, which is an exciting development for us here at Preservica, and we hope for all our customers. However, it also seemed like a timely moment to remind and update you all on the other part of the ADP rollout, Auto Re-Characterization, which has been happening the background on our cloud systems for a while now.
What is Auto-ReCharacterization?
We’ve talked about this feature a lot over the last few years, at User Group meetings, Special Interest Group Webinars, in white papers, and even an iPres paper, so hopefully you’re familiar with that we mean by this. But as a quick recap, the information we know about file formats, and the tools available to extract metadata from different file formats are subject to change and evolution. Ultimately, if you are making preservation or risk decisions on the basis of that information, you want to ensure that you have the best, most up to date information possible.
Re-characterizing content that may have been subject to changes in this information is an important part of Digital Preservation, but manually selecting the content and executing the processes is time-consuming and error-prone. Preservica can use “Recommended Processes” published to our Technical Registry to automate this process, selecting relevant content and ensuring that it gets recharacterized.
What is a Recommended Process?
Our Digital Preservation team make recommendations for processes that they think should happen automatically on Preservica systems, specifically, recharacterization processes. These are published to the Technical Registry by creating JSON encoded statements of what they think should happen.
This JSON typically contains a list of file formats, and some event ranges, which together specify that content identified in one of the formats during the specified range should be recharacterized. They may contain further restrictions or criteria for inclusion, for example including content that couldn’t be identified in that range, or including content that was identified solely on the basis of the file extension.
For example, when Canon release version 3 of their camera RAW format, they extended the MPEG “MP4” standard. This format was released sometime before a signature for it was added to PRONOM, so when it was, there was a chance that you would have Canon Raw 3 files in your repository that had been identified as MP4. A Recommended Process for this has in fact been created, which looks like:
{
"id": {
"guid": "1cc87ffe-00cb-5d3f-87b4-d0f3f745424a",
"name": "v108-recommended-process-fmt-199-update",
"namespace": "http://par.preservica.com"
},
"description": "Recommended Process for re-characterising fmt/199 - MPEG-4 Media File",
"processType": "characterise",
"priority": "low",
"applicableEventRange": {
"to": "6.5.0",
"format": "version",
"eventType": "characterisation"
},
"applicableFormats": [
{
"guid": "1d6dc249-b131-5492-bab2-60d98fa73e02",
"name": "fmt/199",
"namespace": "http://www.nationalarchives.gov.uk"
}
],
"originatingEntities": [
{
"guid": "f30bcf91-7bdb-5ac5-a21c-2f82363e377a",
"name": "fmt/1595",
"namespace": "http://www.nationalarchives.gov.uk"
}
],
"additionalOptions": [],
"notes": "fmt/199 MP4 last characterised before PRONOM v101/Preservica 6.5.0 may be an alternative ID outcome, such as fmt/1595 Canon Raw 3"
}
From this, we can see that this applies to content identified as MP4 (PRONOM PUID fmt/199), that was last characterized by Preservica prior to version 6.5.0. The notes for this indicate that the addition of Canon Raw 3 (PRONOM PUID fmt/1595) to PRONOM was the trigger for this to be added.
In this further example, a new identification signature was added to PRONOM in the v119 update, which applies to Preservica 7.4.0, for x-fmt/40, the AutoCAD dbConnect Template Set file format. It’s a rare file format, with an extension of .dbt.
Any file instances that identified as x-fmt/40 prior to Preservica 7.4.0 would have done so by extension only, rather than the more accurate signature-based identification method. By recharacterizing these file instances, they will now either get a firm identification outcome of x-fmt/40 or they may receive an alternative identification outcome indicating that the original extension-based identification was incorrect.
{
"id": {
"guid": "f888725b-c3da-545b-879f-42721741c4ff",
"name": "recharacterize-x-fmt-40-v119",
"namespace": "http://par.preservica.com"
},
"description": "Recommended Process for re-characterizing x-fmt/40 - AutoCAD dbConnect Template Set",
"processType": "characterise",
"priority": "low",
"applicableEventRange": {
"to": "7.4.0",
"format": "version",
"eventType": "characterisation"
},
"applicableFormats": [
{
"guid": "e1a5d848-40d3-57bd-bf42-5fec9b48c8f2",
"name": "x-fmt/40",
"namespace": "http://www.nationalarchives.gov.uk"
}
],
"originatingEntities": [
{
"guid": "e1a5d848-40d3-57bd-bf42-5fec9b48c8f2",
"name": "x-fmt/40",
"namespace": "http://www.nationalarchives.gov.uk"
}
],
"additionalOptions": [],
"notes": "x-fmt/40 had new identification signature added in v119"
}
Each time these are published, Preservica’s Automated Digital Preservation feature picks them up and starts the processes described.
How do we make these recommendations
Our resident File Format Expert, David Clipsham, performs a comprehensive Technology Watch, analysing changes to PRONOM with each release and piecing together what impact those changes may have on how content gets characterized. This work is documented in our Automated Preservation Recommendations Wiki, which is regularly updated with new recommendations.
In addition, for each PRONOM release, we will create the formal, JSON based Recommended Processes required by Preservica’s Technical Registry. We currently have a list of over 650 recommendations!
Most data that ends up in the Preservica Technical Registry is bundled into the application so that it gets added during upgrades, however, the Recommended Processes are added in an essentially ad-hoc manner while the application is running. This means that they can published as and when required.
When will we publish these recommendations
For Preservica’s cloud systems, we actually started our roll out in December 2022, and have been periodically publishing batches of recommendations ever since. In the process, we’ve made our way through over 250 of our list of recommendations, and we believe we have targeted something in the region of a quarter of a million files for recharacterization in that time!
This ADP process has become very much a part of Business As Usual for Preservica.
More updates from Preservica
Developer Blog - API Updates in Preservica 7.4
This post provides you with a summary of the API updates we have made in Preservica 7.4.
Sam Marshall
October 28th, 2024
Developer Blog - API Updates in Preservica 7.3
This post provides you with a summary of the API updates we have made in Preservica 7.3.
Lukasz Sadowski
June 19th, 2024
Developer Blog - API Updates in Preservica 7.2
This post provides you with a summary of the API updates we have made in Preservica 7.2.
Lukasz Sadowski
June 19th, 2024
Developer Blog - API Updates in Preservica 7.1
This post provides you with a summary of the API updates we have made in Preservica 7.1.
Richard Smith
March 26th, 2024
Open API library and latest developments on GitHub
Visit the Preservica GitHub page for our extensive API library, sample code, our latest open developments and more.
Protecting the world’s digital memory
The world's cultural, economic, social and political memory is at risk. Preservica's mission is to protect it.