One recurring question I hear is how to version graph data. Versioning here means to keep track of changes made to the graph. These changes could be:
modifying properties on nodes or relationships (add, remove change properties)
structural changes (add or removal of nodes and relationships)
Ian Robinson wrote a blog post about the topic back in 2014. Unfortunately, his blog is deleted and only available via the internet archive. Many of the ideas here are derived from his post. If you prefer watching videos, there is a great talk from the Nodes 2019 online conference by Lju Lazarevic on that subject. There is also a textual version of that talk, where Lju is also investigating bi-temporal versioning.
There are many different reasons to track changes to the graph and the selected strategy has the fit that needs. Esp, if you want to keep an audit log of changes, the solution will most likely differ from the options discussed here.
Depending on the volatility of your data, keeping old versions might increase the size of the graph significantly. In graph modelling, you usually structure your model for optimized querying. If you are mostly interested in the current state of the graph and only seldom access the history, you will want to structure the versioning in a way that the current state is easy to access.
The general idea proposed by Ian is to separate structure (identity) from state. The following picture shows a single Product node with 2 versions:
|I use colour-coding in the examples. Blue indicates identity nodes/relationships, and red indicates state (change) relationships/nodes.|
Here, all information about the Product is moved to the ProductState nodes. For every change to the Product, a new ProductState node is created. The current state can be found by following the un-terminated STATE relationship (the one without a 'to' property). Depending on the change frequency on the Product node, this can be an expensive operation, as one would have to filter all STATE relationships to find the current state of the Product.
If you need the current state most of the time, keeping the state in the node itself (in our case, the Product) simplifies access. The next picture describes such a model.
Reading the state and history becomes easier in this model, but at the price of more logic when changing data. In this example, I also used a linked list for the history of changes. With this, we only need to store the time of the change in the relationship. Listing the history becomes a simple traversal from the Product node.
Tracking changes to the structure of a graph means tracking changes to the relationships in the graph. For this, we can add
to temporal attributes to the relationships. A missing
to attribute would indicate the current relationship. If no relationship without
to exists then we have no path from the source node to the target node in the current graph.
Depending on the queries you ask your graph and the frequency at with the relationships change, this might not be practical. A solution would be to have 'versioned' relationships with
to attributes and your 'normal' relationships as you would have them in an un-versioned graph. See the next picture for an example:
In this example, changes to the
qt attribute lead to a new version of the relationship. Changing the type of relationships leads to expressive and faster queries, but requires more effort when writing. APOCs apoc.refactor.rename.type() can help with rewriting the relationship.
Modeling the history of a node in a linked list is of particular interest if the state changes often, as Neo4j currently can’t index properties on relationships. Ordering / filtering large amount of relationships might become unpractical.
To track deletions of nodes and keep their history, we have to implement some sort of 'soft deletions', e.g. mark them as deleted, but keep them in the graph. Every relationship touching the node needs to be marked as not valid any more (see above). With this, traversals should not reach the node. For the node itself, we have the following options:
add a boolean flag 'deleted' and a deletion time to the node. This implies, that every query has to be changed to take the flag into account. Missing this, we might include the node into the result for a query asking for the current state of the graph. On the other hand, this option is easy to implement.
add a marker Label 'Deleted'. Basically the same as above, with a slightly different query syntax.
rename the Label and add a deletion time attribute to the node. This is more complex at the deletion time, but we do not need to change any queries that only look at the current state of the graph.
|Changing the type and labels of the nodes/relationships also allows us to restrict access to these nodes using schema based access control introduced in Neo4j 4.0|
To keep track of changes in the graph, we need to detect changes. This could be done in application code, but this is error-prone and easy to forget. Another approach is to use Neo4j versions of triggers. These are plugins written in Java and registered with the Neo4j instance at startup. The Neo4j API allows registering transaction event listeners. As the name implies, these are part of a transaction and have access to all data changes in the transaction. If you are looking for an example implementation, have a look at neo4j-versioner-core.