Getting Started

Using Python with the Preservica Entity APIs (Part 2)

James Carr

May 27th, 2020

In my previous article on using the Preservica Entity API with Python we looked at creating the authentication token used by all the web service calls and then showed how we could use the token to request basic information about the intellectual assets held in the Preservica repository.

We created a simple python module with two functions to create an authentication token and a function to return some basic information about an asset using its primary reference as an identifier. One downside of this approach is that we need to make sure the token has not expired (they stop working after 15 minutes) and we need to pass the name of the Preservica server name and token to every function we call.

A more convenient way of using the REST API is to encapsulate it with a Python class this will allow us to hide the additional overhead of the creation and management of the authentication tokens. We can also then build out the functionality of the class to mirror the capabilities of the REST API.

We are going to create a Python class which will store information such as the server name, tenant and user credentials and will be responsible for creating new authentication tokens as needed after they expire.

The following is a sample python class which we will use throughout the rest of these tutorials, the end-result will be a re-usable python library which will make working with the Preservica API straightforward.

If we create a new python file called entityAPI.py and add the following code to it, we have created a new python module and class. The token creation function is the same code as we created in part 1, but this time we have used a naming convention to show that users of the class do not need to worry about calling the function. New tokens will be created as they are needed on demand.

class EntityAPI:
    
    def __init__(self, username, password, tenant, server):
        self.username = username
        self.password = password
        self.tenant = tenant
        self.server = server
        self.token = self.__token__()
        
    def __token__(self):
        response = requests.post(f'https://{self.server}/api/accesstoken/login?username={self.username}&password={self.password}&tenant={self.tenant}')
        if response.status_code == 200:
            return response.json()['token']
        else:
            print(f"new_token failed with error code: {response.status_code}")
            print(response.request.url)
            raise SystemExit

To create an instance of the class we need to import the EntityAPI class from the entityAPI module and require that the user to provide authentication details, server name and tenant, but once they have been provided the user can forget them.

from entityAPI import EntityAPI

entity = EntityAPI(username="james.carr@preservica.com", password="ABC1345", tenant="PREVIEW", server="preview.preservica.com")

The entity object that has been created now contains a valid authentication token ready to be used.

In Part 1 we provided a quick overview of the data model and showed how we could fetch information back from assets, if we look at the API documentation we can see that calls to fetch digital assets (information objects) return very similar information to those that return folders/collections/series (structural objects), i.e. they hold attributes such as Title, Description, Security Tag etc.

Making use of this fact we can define a base class for both assets and folders which hold the same common attributes. The Folder and Asset classes are just two different types (sub-classes) of Entity with a different type attribute value.

class Entity:
    def __init__(self, reference, title, description, security_tag, parent, metadata):
        self.reference = reference
        self.title = title
        self.description = description
        self.security_tag = security_tag
        self.parent = parent
        self.metadata = metadata
        self.type = None

class Folder(Entity):
    def __init__(self, reference, title, description, security_tag, parent, metadata):
        super().__init__(reference, title, description, security_tag, parent, metadata)
        self.type = "SO"

class Asset(Entity):
    def __init__(self, reference, title, description, security_tag, parent, metadata):
        super().__init__(reference, title, description, security_tag, parent, metadata)
        self.type = "IO"

We can use the code created in part 1 to fetch metadata attributes from an XML response, this time the function is an internal helper method and not one which expect users of the class to call. This function will work with data from calls to both assets and folders. This method does not require any access to the class variables as its only job is to parse the attributes out of an XML document. The method returns a Python dictionary containing the attribute values.

Note we have also extended the function to retrieve information about any parent objects, (parents of assets are always folders and parents of folders are either other folders or nothing for root level folders), and links to descriptive metadata fragments. The XML response from the API call does not include the actual descriptive metadata but it does include a URL you can use to fetch the descriptive metadata which is what we will store. We will cover the actual retrieval of the metadata later.

Folders and Assets can contain multiple descriptive metadata documents and each document has a namespace, so we have added information on each URL to a python dictionary. They key of the dictionary is the URI of the API call to fetch that metadata and the dictionary value is the namespace of the metadata document.

Note that Preservica assets and folders may contain multiple metadata documents within the same namespace, but this is not a problem because python dictionary objects can contain different keys with the same value.

As we showed in the previous article, we can extract attributes from the web services responses by converting them to in-memory XML documents and using XPATH expressions to extract values.

def __entity__(xml_data):
    entity_response = xml.etree.ElementTree.fromstring(xml_data)
    reference = entity_response.find('.//{http://preservica.com/XIP/v6.0}Ref')
    title = entity_response.find('.//{http://preservica.com/XIP/v6.0}Title')
    security_tag = entity_response.find('.//{http://preservica.com/XIP/v6.0}SecurityTag')
    description = entity_response.find('.//{http://preservica.com/XIP/v6.0}Description')
    parent = entity_response.find('.//{http://preservica.com/XIP/v6.0}Parent')
    if hasattr(parent, 'text'):
        parent = parent.text
    else:
        parent = None

    fragments = entity_response.findall(
        './/{http://preservica.com/EntityAPI/v6.0}Metadata/{http://preservica.com/EntityAPI/v6.0}Fragment')
    metadata = {}
    for fragment in fragments:
        metadata[fragment.text] = fragment.attrib['schema']

    return {'reference': reference.text, 'title': title.text, 'description': description.text,
            'security_tag': security_tag.text, 'parent': parent, 'metadata': metadata}

Putting this all together we can now add the public facing methods to return information about assets and folders. Most of the code for these two functions is the same apart from the actual web service endpoint (URL).

def asset(self, reference):
    headers = {'Preservica-Access-Token': self.token}
    request = requests.get(f'https://{self.server}/api/entity/information-objects/{reference}', headers=headers)
    if request.status_code == 200:
        xml_response = str(request.content.decode('UTF-8'))
        entity = __entity__(xml_response)
        a = self.Asset(entity['reference'], entity['title'], entity['description'], entity['security_tag'], entity['parent'], entity['metadata'])
        return a
    elif request.status_code == 401:
        self.token = self.__token__()
        return self.asset(reference)
    else:
        print(f"asset failed with error code: {request.status_code}")
        print(request.request.url)
        raise SystemExit

def folder(self, reference):
    headers = {'Preservica-Access-Token': self.token}
    request = requests.get(f'https://{self.server}/api/entity/structural-objects/{reference}', headers=headers)
    if request.status_code == 200:
        xml_response = str(request.content.decode('UTF-8'))
        entity = __entity__(xml_response)
        f = self.Folder(entity['reference'], entity['title'], entity['description'], entity['security_tag'], entity['parent'],
                       entity['metadata'])
        return f
    elif request.status_code == 401:
        self.token = self.__token__()
        return self.folder(reference)
    else:
        print(f"folder failed with error code: {request.status_code}")
        print(request.request.url)
        raise SystemExit

The methods call the relevant URL endpoint, convert the resulting response into an XML document, extract the required attributes and then create either a Folder or Asset object which is returned to the caller.

The other differences with the code in part 1 of this tutorial is that we are now checking for HTTP status codes of 401 which signifies that the call was unauthorised which in this context usually means that the token has expired.

If the function’s HTTP request returns a 401 error, then we simply request a new token using our stored credentials and re-run the function again. This way we only create new tokens as needed and the callers of our class never have to worry about providing active tokens.

The one last part of folders and assets we should cover is returning any descriptive metadata documents attached to them.

If we look at the web service document for Information Objects and Structural Objects we see that descriptive metadata is managed in the same way, in both cases calling the get metadata methods for both assets and folders returns the same response. This means we only need to provide a single python function for both use cases.

A typical response from calling the get metadata API call is shown below. The actual descriptive XML we need is found in the Content element.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<MetadataResponse xmlns="http://preservica.com/EntityAPI/v6.0" xmlns:xip="http://preservica.com/XIP/v6.0">
    <xip:MetadataContainer schemaUri="http://preservica.com/schema/sample/v1.0">
        <xip:Ref>dda13399-a6c1-420e-8d47-458062c43209</xip:Ref>
        <xip:Entity>a9e1cae8-ea06-4157-8dd4-82d0525b031c</xip:Entity>
        <xip:Content>
            <SampleContent xmlns="http://preservica.com/schema/sample/v1.0" xmlns:ns3="http://preservica.com/EntityAPI/v6.0">
                <SampleElement xmlns:ns6="http://preservica.com/schema/sample/v1.0" xmlns="">Metadata fragment content</SampleElement>
            </SampleContent>
        </xip:Content>
    </xip:MetadataContainer>
    <AdditionalInformation>
        <Self>https://us.preservica.com/api/entity/information-objects/a9e1cae8-ea06-4157-8dd4-82d0525b031c/metadata/dda13399-a6c1-420e-8d47-458062c43209</Self>
    </AdditionalInformation>
</MetadataResponse>

We can use the following function to call the API, query the response and return the descriptive metadata as a string.

def metadata(self, uri):
    headers = {'Preservica-Access-Token': self.token}
    request = requests.get(uri, headers=headers)
    if request.status_code == 200:
        xml_response = str(request.content.decode('UTF-8'))
        entity_response = xml.etree.ElementTree.fromstring(xml_response)
        content = entity_response.find('.//{http://preservica.com/XIP/v6.0}Content')
        return xml.etree.ElementTree.tostring(content[0], encoding='utf8', method='xml').decode()
    elif request.status_code == 401:
        self.token = self.__token__()
        return self.metadata(uri)
    else:
        print(f"metadata failed with error code: {request.status_code}")
        print(request.request.url)
        raise SystemExit

We now have a python class for requesting assets and folders by their reference and can print their attributes.

The following python script shows the capabilities of the class, we can request either assets and folders by their reference and show their attributes and we can find the parent reference for entities to walk up the repository hierarchy until we reach the root folders which no longer have a parent and we can output descriptive metadata documents attached to assets or folders.

from entityAPI import EntityAPI

entity = EntityAPI(username="james.carr@preservica.com", password="ABC1345", tenant="PREVIEW", server="preview.preservica.com")

asset = entity.asset("6a596701-75ae-45b7-933d-355787e25a28")

print(asset.title)
print(asset.description)
print(asset.security_tag)
print(asset.parent)

folder = entity.folder(asset.parent)

print(folder.title)
print(folder.description)
print(folder.security_tag)
print(folder.parent)

while folder.parent is not None:
   folder = entity.folder(folder.parent)
   print(folder.title)


for metadata in asset.metadata:
    print(entity.metadata(metadata))


for metadata in folder.metadata:
    print(entity.metadata(metadata))

We now have a library which is starting to become useful for real tasks, although its more useful for working up the hierarchy from the assets up. A lot of API use cases require enumeration of objects in the repository from the top down, i.e. starting with the top-level collections or Fonds and working down to the assets and that is what we are going to look at next.

To retrieve a list of entities (either assets or folders) from within a folder we use the children API call on the structural object endpoint.

Looking at the swagger documentation for this call we see that the response is an XML document containing a list of children, each child object is defined by its type (asset or folder), its unique reference and its title. There is one other issue we should be aware of when calling this web service that is that it’s a paged result. The number of children returned may not be the full set. To get the complete list the service may have to be called multiple times.

The reason for this is that a folder may contain many thousands or even millions of assets and having a non-paged web service could result in a very large XML document being returned. This would make the call slow, put undue load on the server and could cause networking timeouts etc.

As a user of the web service we do have flexibility on deciding the maximum size of a result set that is returned, i.e. the web service allows the caller to specify the maximum number of results in the response.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ChildrenResponse xmlns="http://preservica.com/EntityAPI/v6.0" xmlns:xip="http://preservica.com/XIP/v6.0">
    <Children>
        <Child title="Fiction" ref="a322f2c3-8f70-41ad-beb8-3506eac9fd23" type="SO">https://us.preservica.com/api/entity/structural-objects/a322f2c3-8f70-41ad-beb8-3506eac9fd23</Child>
        <Child title="Non-fiction" ref="b5297141-419c-4cdf-8f1b-602bbe46485b" type="SO">https://us.preservica.com/api/entity/structural-objects/b5297141-419c-4cdf-8f1b-602bbe46485b</Child>
        <Child title="Library_location_map" ref="9bb82d3e-6f52-4dda-a423-1b0fb1f6ee52" type="IO">https://us.preservica.com/api/entity/information-objects/9bb82d3e-6f52-4dda-a423-1b0fb1f6ee52</Child>
    </Children>
    <Paging>
        <Next>https://us.preservica.com/api/entity/structural-objects/a9e1cae8-ea06-4157-8dd4-82d0525b031c/children?max=3&amp;start=3</Next>
        <TotalResults>5</TotalResults>
    </Paging>
    <AdditionalInformation>
        <Self>https://us.preservica.com/api/entity/structural-objects/a9e1cae8-ea06-4157-8dd4-82d0525b031c/children?max=3</Self>
    </AdditionalInformation>
</ChildrenResponse>

Normally in these types of paged results set web service calls, the caller (you) are responsible for keeping track of which page of results you have seen and what is the next page to request. In the Preservica API this book-keeping is done for you, the XML response contains a <Next> attribute with the correct URL to request the next page of results.

All we need to do is keep track of if we need to request the next page. To do that we can create a small help class to represent a page of generic results. As long as we don’t have the full set of results yet, the has_more attribute will be True.

class PagedSet:
    def __init__(self, results, has_more, total, next_page):
        self.results = results
        self.has_more = has_more
        self.total = total
        self.next_page = next_page

The attribute results is a python set containing any type of paged result set.

The following method then queries Preservica for the children of a folder given by its unique reference and returns a PagedSet object containing a mixed list of assets and child folders. The response from the children web service only returns limited data about the asset or folder which is why we have had to replace the description, security tag and metadata attributes with the value of Python None which signifies missing data. We do know the parent of the children since this is just the reference which was passed to the function.

If we want to get the children of the root folders, the API provides a special endpoint

which we can call use by passing None as the folder reference. If the caller passes in a URL to the next page, then this signifies that we are requesting subsequent pages and we no longer need to generate the starting URL.

def children(self, reference, maximum=100, next_page=None):
    headers = {'Preservica-Access-Token': self.token}
    if next_page is None:
        if reference is None:
            request = requests.get(f'https://{self.server}/api/entity/root/children?start={0}&max={maximum}', headers=headers)
        else:
            request = requests.get(f'https://{self.server}/api/entity/structural-objects/{reference}/children?start={0}&max={maximum}', headers=headers)
    else:
        request = requests.get(next_page, headers=headers)
    if request.status_code == 200:
        xml_response = str(request.content.decode('UTF-8'))
        entity_response = xml.etree.ElementTree.fromstring(xml_response)
        childs = entity_response.findall('.//{http://preservica.com/EntityAPI/v6.0}Child')
        result = set()

        next_url = entity_response.find('.//{http://preservica.com/EntityAPI/v6.0}Next')
        total_hits = entity_response.find('.//{http://preservica.com/EntityAPI/v6.0}TotalResults')

        for c in childs:
            if c.attrib['type'] == 'SO':
                f = self.Folder(c.attrib['ref'], c.attrib['title'], None, None, reference, None)
                result.add(f)
            else:
                a = self.Asset(c.attrib['ref'], c.attrib['title'], None, None, reference, None)
                result.add(a)
        has_more = True
        url = None
        if next_url is None:
            has_more = False
        else:
            url = next_url.text
        ps = self.PagedSet(result, has_more, total_hits.text, url)
        return ps
    elif request.status_code == 401:
        self.token = self.__token__()
        return self.children(reference, maximum=maximum, next_page=next_page)
    else:
        print(f"children failed with error code: {request.status_code}")
        print(request.request.url)
        raise SystemExit

To use the paginated methods, we can write client code such as the following. We start by passing None as the next page URL to signify we need the first page of results and after that we can pass the URL returned from the last call.

next_page = None
while True:
    root_folders = entity.children(None, next_page=next_page)
    for e in root_folders.results:
        print(f'{e.title} :  {e.reference}')
    if not root_folders.has_more:
        break
    else:
        next_page = root_folders.next_page

This code will loop over a block calling entity.children if has_more is false we break out the loop, otherwise we get the URL to the next page of results and pass it to the function again.

Adding this code to our sample client script we now have:

from EntityAPI.entityAPI import EntityAPI

entity = EntityAPI(username="james.carr@preservica.com", password="ABC1345", tenant="PREVIEW", server="preview.preservica.com")

asset = entity.asset("6a596701-75ae-45b7-933d-355787e25a28")
print(asset.title)
print(asset.description)
print(asset.security_tag)
print(asset.parent)

folder = entity.folder(asset.parent)
print(folder.title)
print(folder.description)
print(folder.parent)

while folder.parent is not None:
   folder = entity.folder(folder.parent)
   print(folder.title)


for metadata in asset.metadata:
    print(entity.metadata(metadata))


next_page = None
while True:
    root_folders = entity.children(None, maximum=10, next_page=next_page)
    for e in root_folders.results:
        print(f'{e.title} :  {e.reference}')
    if not root_folders.has_more:
        break
    else:
        next_page = root_folders.next_page

The complete source code and example python script is available from Github.

In the next article we will look at how we can update and create new entities in the Preservica repository.

More updates from Preservica

Getting Started

Custom Reporting via the Preservica Content API

Preservica provides a REST API to allow users to query the underlying search engine. In this article we will show how CSV documents can be returned by the API.

James Carr

November 29th, 2021

Getting Started

Using OPEX and PAX for Ingesting Content

Preservica has developed the concept of an OPEX (Open Preservation Exchange) package, a collection of files and folders with optional metadata, as a way to organise content into an easy to understand format for transfer into or out of a digital preservation system. Although we have created it, we hope suppliers of digital content to be preserved, and other digital preservation systems, will use it due to its simplicity.

Richard Smith

January 28th, 2021

Getting Started

Using the PAR API to create Custom Migrations

Since the release of v6, Preservation Actions within Preservica have been defined and controlled using a PAR (Preservation Action Registries) data model. To facilitate this, Preservica’s registry also exposes a PAR API to allow a full range of CRUD operations on this data. This API also makes it possible to write new migration actions using Preservica’s existing toolset, for example, to introduce re-scaling to your image/video migrations, or to get different output formats altogether. In this article, we will introduce the key concepts in this data model, explain how Preservica uses and interpret them, and introduce the API calls required to create your own custom actions. We will do this by a worked example, using ImageMagick to create a custom “re-size migration” for images.

Jack O'Sullivan

August 11th, 2020

Getting Started

Using Python with the Preservica Entity APIs (Part 3)

In this article we will be looking at API calls which create and update entities within the repository, some calls to add and update descriptive metadata and we will also look at the use of external identifiers which are useful if you want to synchronise external metadata sources to Preservica.

James Carr

June 11th, 2020

Preservica on Github

Open API library and latest developments on GitHub

Visit the Preservica GitHub page for our extensive API library, sample code, our latest open developments and more.

Preservica.com

Protecting the world’s digital memory

The world's cultural, economic, social and political memory is at risk. Preservica's mission is to protect it.