Henge tutorial

Henge is a Python package that builds backends for generic decomposable recursive unique identifiers (or, DRUIDs). It was started as building block for sequence collections (see seqcol), but can also be used for other data types that need content-derived identifiers.

Record the version used in this tutorial:

import henge
henge.__version__
'0.0.3'

Introduction to object-derived unique identifiers

You can use henge as a basic back-end for a key-value database with value-derived identifiers. A henge is ultimately a database that stores values, allowing them to be looked up.

To introduce you to the basic idea, we want to store simple strings, and make it possible to retrieve them with their md5 digests. In the simplest case, say we're interested in storing strings. We define an algorithm to obtain a unique identifier for the string; for example, we may take the md5 digest of the string. We then store the key (md5 digest) and value (string) in a database, and allow retrieving the the string given its identifier.

Henge defines data types using JSON-schema. Let's define a data type called sequence which is just a string, or a sequence of characters:

!cat "../tests/data/sequence.yaml" 
description: "Schema for a single raw sequence"
type: object
henge_class: sequence
properties:
  sequence:
    type: string
    description: "A sequence of characters"
required:
  - sequence

We construct a henge object that is aware of this data type like this:

h = henge.Henge(database={}, schemas=["../tests/data/sequence.yaml"])
h
Henge object
Item types: sequence
Schemas: {'sequence': {'description': 'Schema for a single raw sequence', 'type': 'object', 'henge_class': 'sequence', 'properties': {'sequence': {'type': 'string', 'description': 'A sequence of characters'}}, 'required': ['sequence']}}

Insert a sequence object which will be stored in the database:

checksum = h.insert({"sequence":"TCGA"}, item_type="sequence")

And you can retrieve it with its checksum:

h.retrieve(checksum)
{'sequence': 'TCGA'}

Introduction to DRUIDs

The power of henge becomes more apparent when we want to store more complicated objects. A DRUID builds on the basic value-derived identifiers by allowing the objects to be decomposable and recursive. In other words, the value stored in the database can have multiple elements (decomposible); and 2) each element which may, itself, be an independent value stored individually in the database (recursive).

Decomposing: storing multi-property objects

To demonstrate, we'll first show an example with a data type that has more than one property. Let's say we want to make a henge that stores and retrieves objects of type Person. We define a JSON-schema for a Person, which has 2 attributes: a string name, and an integer age:

!cat "../tests/data/person.yaml"                                
description: "Person"
type: object
henge_class: person
properties:
  name:
    type: string
    description: "String attribute"
  age:
    type: integer
    description: "Integer attribute"
required:
  - name

Now we will create a henge either with the schema dict object, or a path to a yaml file:

import henge
person_henge = henge.Henge(database={}, schemas=["../tests/data/person.yaml"])
person_henge.item_types
['person']

Use insert to add an item to the henge, providing the object and its type. The henge will use JSON-schema to make sure the object satisfies the schema.

druid1 = person_henge.insert({"name":"Pat", "age":38}, item_type="person")

When you insert an item into the henge, it returns the unique identifier (or, the DRUID) for that item. Then, you can use the unique identifier to retrieve the item from the henge.

person_henge.retrieve(druid1)
{'name': 'Pat', 'age': '38'}

Our schema listed name as a required attribute. Here's what happens if we try to insert non-conforming data:

person_henge.insert({"first_name":"Pat", "age":38}, item_type="person")
Not valid data
Attempting to insert item: {'age': 38}
Item type: person

'name' is a required property

Failed validating 'required' in schema:
    {'description': 'Person',
     'henge_class': 'person',
     'properties': {'age': {'description': 'Integer attribute',
                            'type': 'integer'},
                    'name': {'description': 'String attribute',
                             'type': 'string'}},
     'required': ['name'],
     'type': 'object'}

On instance:
    {'age': 38}

False

Recursion: storing structured data

Next, we'll show an example of a data type that contains other complex data types. Let's define a Family as an array of parents and an array of children:

!cat "../tests/data/family.yaml" 
description: "Family"
type: object
henge_class: family
properties:
  domicile:
    type: object
    henge_class: location
    properties:
      address:
        type: string
  parents:
    type: array
    henge_class: people
    items:
      type: object
      henge_class: person
      properties:
        name:
          type: string
          description: "String attribute"
        age:
          type: integer
          description: "Integer attribute"
      required:
        - name
  children:
    type: array
    henge_class: people
    recursive: true
    items:
      type: object
      henge_class: person
      properties:
        name:
          type: string
          description: "String attribute"
        age:
          type: integer
          description: "Integer attribute"
      required:
        - name
required:
  - parents
recursive:
  - parents
  - children

In our family object, parents are required, which is a People object, which is an array with one or more Person objects. The children attribute is optional, which is also a People object with one or more Person objects. Our Family object also has a domicile attribute, which is a Location object that has an address property.

famhenge = henge.Henge(database={}, schemas=["../tests/data/family.yaml"])
famhenge.item_types
['family', 'location', 'people', 'person']

Now, this henge can accommodate objects that subscribe to this structure data type. Let's build a simple family object and store it in the henge:

myfam = {'domicile': '',
 'parents': [{'name': 'Pat', 'age': 38}, {'name': 'Kelly', 'age': 35}],
 'children': [{'name': 'Oedipus', 'age': 2}]}
myfam_druid = famhenge.insert(myfam, "family")
myfam_druid
'bc43e39e7f589ecda3865b39438905af'

As before, we can retrieve the complete structured data using the digest:

famhenge.retrieve(myfam_druid)
{'domicile': '',
 'parents': [{'name': 'Pat', 'age': '38'}, {'name': 'Kelly', 'age': '35'}],
 'children': [{'name': 'Oedipus', 'age': '2'}]}

One of the powerful features of Henge is that, under the hood, henge is actually storing objects as separate elements, each with the own identifiers, and you can retrieve them individually. This becomes more apparent when we use the reclimit argument to limit the number of recursive steps. If we allow no recursion, we'll pull out the digests for the People objects:

famhenge.retrieve(myfam_druid, reclimit=0)
{'domicile': '',
 'parents': '6a9f4378876423f7d032fc86a5eca4d1',
 'children': '98646a8b05f9e0de892e98e256097d40'}

We can recurse one step further to get digests for the Person objects:

famhenge.retrieve(myfam_druid, reclimit=1)
{'domicile': '',
 'parents': ['685a5a70a3d9450e42346bc36ca4ff11',
  '4d3433cc9446fcf5038a21b088013762'],
 'children': ['20393736960360496a40f29877ec1634']}

These identifiers can be used individually to pull individual items from the database:

digest = famhenge.retrieve(myfam_druid, reclimit=1)['parents'][1]
digest
'4d3433cc9446fcf5038a21b088013762'
famhenge.retrieve(digest)
{'name': 'Kelly', 'age': '35'}

You can also insert the sub-components (like People or Person) into the database:

druid1 = famhenge.insert({"name":"Pat", "age":38}, item_type="person")
druid2 = famhenge.insert({"name":"Kelly", "age":35}, item_type="person")
famhenge.retrieve(druid1)
{'name': 'Pat', 'age': '38'}
famhenge.show()
20393736960360496a40f29877ec1634 Oedipus2
20393736960360496a40f29877ec1634_item_type person
20393736960360496a40f29877ec1634_digest_version md5
98646a8b05f9e0de892e98e256097d40 20393736960360496a40f29877ec1634
98646a8b05f9e0de892e98e256097d40_item_type people
98646a8b05f9e0de892e98e256097d40_digest_version md5
685a5a70a3d9450e42346bc36ca4ff11 Pat38
685a5a70a3d9450e42346bc36ca4ff11_item_type person
685a5a70a3d9450e42346bc36ca4ff11_digest_version md5
c92d4c12cd07816d4bd25b9bea4e353f Kelly5
c92d4c12cd07816d4bd25b9bea4e353f_item_type person
c92d4c12cd07816d4bd25b9bea4e353f_digest_version md5
4a93ef901177d13ad9c7edfb9c0c449f 685a5a70a3d9450e42346bc36ca4ff11   c92d4c12cd07816d4bd25b9bea4e353f
4a93ef901177d13ad9c7edfb9c0c449f_item_type people
4a93ef901177d13ad9c7edfb9c0c449f_digest_version md5
b9bf6b773bd476399dafa7a39d9aa041 4a93ef901177d13ad9c7edfb9c0c449f98646a8b05f9e0de892e98e256097d40
b9bf6b773bd476399dafa7a39d9aa041_item_type family
b9bf6b773bd476399dafa7a39d9aa041_digest_version md5
4d3433cc9446fcf5038a21b088013762 Kelly35
4d3433cc9446fcf5038a21b088013762_item_type person
4d3433cc9446fcf5038a21b088013762_digest_version md5
6a9f4378876423f7d032fc86a5eca4d1 685a5a70a3d9450e42346bc36ca4ff11   4d3433cc9446fcf5038a21b088013762
6a9f4378876423f7d032fc86a5eca4d1_item_type people
6a9f4378876423f7d032fc86a5eca4d1_digest_version md5
bc43e39e7f589ecda3865b39438905af 6a9f4378876423f7d032fc86a5eca4d198646a8b05f9e0de892e98e256097d40
bc43e39e7f589ecda3865b39438905af_item_type family
bc43e39e7f589ecda3865b39438905af_digest_version md5