Semedia Logo
RDFSync efficient synchronization of RDF models
By Giovanni Tummarello, Christian Morbidoni

Introduction

Remote RDF model synchronization is a procedure by which one efficently updates a local model based on a remote one. By updating one can mean "make identical to the remote" (classic RSync) but in this case it might mean "patch" with information from a remote model not present in the current one.

Classic procedures such as RSync cannot be applied to RDF models since they usually serialize nondeterministically. Even if a deterministic serialization algorithm such as [2] is used, the performance of RSync remain very poor, as minimal modifications in the model might cause completely different serializations.

In this paper we'll illustrate an algorithm for efficient RDF Synchronization called RDFSync.

The RDFSync algorithm

According to the Minimum Self Contained Graph (MSGs) [1] theory, a graph can be decomposed univocally into a list of MSG (usually a set, but not necessarily so).

Each MSG can then be serialized canonically using algorithms such as [2] and hashed to an appropriate number of bits to reasonably avoid collision (math and common sense say 128-160 bits will do).

In the RDFSync algorithm, ordered lists of such hashes are created independently at each end and then either exchanged directly, (which can be shown to be most efficient for large changes) or RSynced, (highly efficient for small changes).

Once the 2 lists are available to the requesting host, a simple ordered list diff is computed so to obtain:

  • The list of MSGs to delete from the local model
  • The list of MSGS to request from the other host

The other side then simply sends the requested MSG and the synchronization is over.


Experimental results


Traffic wise the procedure is highly efficient and greatly outperforms any other alternative we could think of.

The following graphs illustrate the traffic generated in a RDFSynchronization of 2 remote models. Traffic is shown as function of the number of different MSGs in the models which originally contains a base knowledge of 100 identical MSGs (of various sizes and nature, carrying 0 to 3 blank and 1 to 8 triples). Results have been averaged over 10 runs made each with a different randomly generated base knowledge and delta MSGs.

Image


In case of larger DBs the efficiency is even grater. These are the results of the same experiment performed on a larger initial DB.

Image


The Canonics plot refers to the best alternative way we could think of, RSyncing the canonically serialized models.

The other plots refer to 3 flavours of RSync :

MD5 creates a list of MD5 of the MSG decomposition of the graph then applies RSync to synchronize such list. It then transfers the missing MSGs. As expected this is the most efficient method for small deltas

NotRSync is as the previous but exchanges the whole list instead of applying RSync. Most efficient for large deltas

MD5Canonics creates an ordered list of MSGs expressed not as MD5 but as RDFXML. This might be considered as an alternative form of canonical serialization. It is still more efficient than standard canonical serialization+rsync but worse than the other techniques

ISSUES:

Complexity
Computationally speaking, evaluating the MSGs each time one wants to RDFSync is a time consuming procedure (although still O(n) with n the number of triples). Solution is to use caching techniques so that each time a new information is inserted the MSG is calculated. MSGs are by definition immutable (mutating one can be seen as removing and adding another) so this is possible with great efficiency.

Special cases
There might be issues in certain situations where the canonical serialization procedure will behave in a non deterministic way. I.e. the case for very complex blank node composition, something which is luckly not very likely to happen in regular use cases.
The theory can be extended to handle these just fine. The key point is that there is a _finite_ number of possible serializations so it is just a matter of computing all the possible hashes and taking in consideration the equivalence of some in the synchronization procedure. This is something the current release does not address (it might in the future) but the worse thing that could happen in this extreme condition is that such MSGs appeared duplicated after an RDFSync.

Implementation

RDFSync has been implemented and is distributed as part of the RDFContextTools.
results of the implementation have been validated with external RDF model comparison tools such as [6]


Conclusions

The need for remote RDF Synchronization has been highlighted since early SW postings such as [3]. We believe that the proposed procedure opens the way to a number of interesting uses. Beside the obvious ones related to scalability of RDF based web applications, RDFSync enables efficient P2P exchanges of RDF thus greatly improving performances of data exchanging P2P SW algorithms and applications such as RDFGrowth [4] and DBin [5] respectively.

Acknowledgments

Thanks go to M. Leggieri, F.Piazza and P.Puliti.

References

[1] G.Tummarello, C. Morbidoni, P. Puliti, F. Piazza Signing Individual Fragments of an RDF Graph, WWW2005 poster track
[2] J. Carroll Signing RDF Graphs Tech Report: HPL-2003-142: Graphs
[3] T. Berners-Lee? and D. Connolly http://www.w3.org/DesignIssues/Diff
[4] G. Tummarello, C. Morbidoni, J. Petersson, F. Piazza, P. Puliti, "RDFGrowth, a P2P annotation exchange algorithm for scalable Semantic Web applications", Mobiquitous, August 23 -25, 2004, Boston, MA
[5] The DBin Project (cache)
[6] RDF Utils (cache) Reto Bachmann-Gmür

Created by: admin last modification: Tuesday 21 of March, 2006 23:29:12 UTC by admin




login
Login
user
pass
Search
in: