IBM System G Native Graph Store Overview

Further documentation of Native Store:
(1) Software Documentation



Native Store

IBM System G Native Store provides an efficient graph data store solution that can handle various graphs, property graphs, and RDF-like graphs, in terms of storage, analytics, and visualization. It includes both Scale Up (which fully utilizes memory and storage of single machine) and Scale Out (which distributes to many machines) features. Graph Database is one of the 8 categergroies of IBM System G Toolkits.


Native Store Solution
Fig 1. System G Native Store Solution

Native store not only offers persistent graph storage, but also sequential/concurrent/distributed graph runtimes, a set of C++ graph programming APIs, a CLI command set (gShell), a socket client, a socket client GUI, and some visualization toolkit.

db_arch
Fig 2. IBM System G Native Store Overview


Native Store Introduction

IBM System G Native Store provides several in-memory and on-disk storages for property graphs. This web primarily introduces System G Native Store, a C++ template based high performance implementation of graph stores. The Native Store is designed according to the data access patterns in various graph applications and the architecture of commodity processors, aiming at offering highly efficient graph data access for Big Data analytics. Native Store supports sequenetial, concurrrent, and distributed graph runtime libraries in IBM System G middleware, and provides

  • C++ APIs for developers
  • a command based gShell for high leqvel users
  • a JNI layer for translating the native APIs into TinkerPop APIs
  • a SPARQL layers based on JENA through the TinkerPop layers for RDF queries
For users of different levels, the gShell further provides highly intuitive database operations, supporting
  • local interactive mode in CLI
  • socket-based remote operation mode
  • a Web-based interactive mode
  • local Qt GUI frontend.


roadmap
Fig 3. IBM System G Native Store Roadmap


Chacteristics towards better performance:

There are couple of reasons explaining the advantages of IBM System G Native Store:

  • First, native store is designed from scratch for highly efficient graph computing. Its data structures are optimized at multiple layers (disk storage, graph-specific file caching, cpu-cache amenable in-memory graph data structure, optimized scheduler for concurrent graph access, etc.) towards graph computing characteristics, such as the irregular data access patterns. For example, our cpu-cache amenable in-memory graph data structure improves cache hit rate. The L1D cache hit rate is pretty high, which is not common for general graph representations.

  • Second, the C/C++ based kernel of the graph store is easy to incorporate with various hardware innovations. Compared to many open source competitors, our solution offers more opportunties for performance optimization. In constrast, the Java-based packages have challengs to be optimized when beyond the JVM layer.

  • Third, unlike some open source graph databases where only a graph front-end is provided, but a non-graph back-end is used, we organize data always using graph across layers. Therefore, at each layer, we can optimize operations towards graph computing. It is straightforward to see that, if the back-end is not a graph, some graph behaviors may not be well handled. For example, some open source graph database's back-end heavily relies on indices. Although indexing helps, it can also introduce overheads when the graph is very dynamic, where many indices must be updates at runtime. Native graph store, such as our native store and Neo4j, have back-end as a graph, where graph data access and traversal is straightforward.

  • At last, for implementing such cross-layer optimization for graphs, we build our team by having researchers with backgrounds in graph analytics, systems, databases, POWER architecture, high performance computing, compilers. The multi-disciplineary team ensures the success of the solution.

Details of the performance improvement, especially based on Caching, Scheduling, and Graph Structure, can be seen on IBM System G Native Store paper published at IEEE BigData 2014.




System G Native Store Evaluation

Published on IEEE BigData 2014, we experimentally showed the impact of data storage on the performance of graph analytics using a straightforward graph query for visualizing recommendation. We measured the performance on two applications (visualization and recommendation) of a production system of IBM knowledgeView, which consists of about 72,300 user vertices, 82,100 document vertices, and over 1,740,000 edges. Given a vertex of a document d, the graph query creates a subgraph consists of top 100 most relevant documents to d, and we built an edge between the found document and d; then, for each found document we find the top 10 documents again. We implemented the query on graphs that were stored in various storages, including a K/V store (Berkeley DB) with Titan, a HBase with Titan, a graph store (Neo4j) and IBM System G Native Store.

Even though the implemented algorithm was the same, the performance of execution time varies significantly on different DBs. Neo4j and System G Native Store showed superior performance (1.2 and 0.7 seconds, respectively), since the data is organized both in memory and on disk as a graph, resulting in efficient graph traversal and updates. The performance of Titan (6.8 seconds with BerkeleyDB and 5.7 seconds with HBase) was not as good as Neo4j and System G, probably due to the fact that it is only a graph representation on the interface level but not storing data as graph structure.

The performances of the collaborative filtering application, which returns the top 10 related docs, are: Neo4j: 0.07 sec, Titan BerkeleyDB: 0.28 sec, Titan HBase: 0.41 sec, System G GBase: 0.20 sec, and System G Native Store: 0.015 sec.

Neo4j is based on JVM, so it is hard to deep optimize on particular platforms. System G shows improved performance compared, because of our efforts on optimizing the cache and scheduling performance. The performance of graph traversal in various databases is shown below:

experiment
Fig 4. Graph traversal time in various databases


Further Experimental Comparison with Neo4j and Titan BDB

We compared primitive graph operations using System G Native Store, Neo4j, and Titan over Berkeley DB. These operations are used as building blocks for developing various graph analytics. Therefore, the performance of such primitives are highly important for developing efficient graph analytic solutions. We conducted all the experiments on a server with Intel Hashwell (Xeon E5-2697 v2) processors at 2.7 GHz. The memory size is 256 GB. The OS is the RedHat Linux.

We first studied the following three primitive operations:

  • AddVtx: add a vertex to the graph
  • AddEdge: add an edge to the graph
  • QryVtx: find a vertex based on its property

We compared the operations wrt different graph sizes, where the number of vertices ranges from 100,000 to 1,000,000, and the average node degree is 10. The number for plotting the figures are the average results. We actually measured the time for adding all vertices and adding all edges. Thus, the result is not biased toward some vertices or edges, such as those with a few neighbors. We calculated the average and show them below.

We further increased the graph sizes from 1M vertices to 10M and repeated the experiment. The results are plotted as follows, which exhibit consistent comparison as what we can observe from the above figures. Note that we do not have the experimental data for Titan over Berkeley DB, the machine hanged and did not return results for traversing the graph over 3 million edges. We had to terminate the machine after waiting for 24 hours.


The bottom-right figure shows the ratio of query times for Neo4j and Titan BDB vs Native Store. They are several times slower than Native Store.

The results showed that Neo4j is faster than Native Store only in adding vertices. The reason is that Native Store persists every single vertex adding; while Neo4j caches the data and writes it back (instead of write through). In terms of adding edges and query vertices, it clearly shows than Native Store achieves much better performance. The time is approximately in constant; while such time for Neo4j increases as the graph gets larger.

For Titan, it shows similar performance with Native Store for inserting vertices and edges, but lower performance for query vertices. By looking at the Titan source code, we realized two fundamental design issues might be the reason: (1) Titan's underlying data structure is not linked data, thus we can not directly follow the links when traversing a graph; (2) Titan strongly relies on indexing, which can adversely impact the database performance when the graph is large and dynamic.

Second, let's study the throughput of graph traversal. After the graph is populated by adding vertices and edges, it is ready for running graph analytics. Many graph analytics requires traversal of subgraphs. So, the traversal peformance can be highly critical. In the following experiments, we start from a vertex and travse an ego network subgraph in BFS manner for three hops. Then, we sort the vertices at three hops away from the starting vertex. We measured the number of edges traversed in each query (TEPS) and show the results accordingly. The result is shown in the following figure. As it shows, Native Store exhibits the best performance, and Neo4j is the second. Note that the curve for Native Store look a little bit more fluctuated. This is caused by the very low execution of our implementation, so any system noise can make the fluctuation look stronger.


We further increased the graph sizes and obtained consistent observations on graph traversal throughput as follows.


Third, let's study the scalability. It is worth noting that all the above comparisons were done using a SINGLE thread in System G Native Store. Native Stores offers three runtime libraries: a single-thread version, a concurrent (multithreading) version, and a distributed version (with PAMI), all with almost the same graph APIs, for greatly reducing your porting efforts. There is no such choice available in Neo4j or BerkeleyDB. Using multithreading, Native Store has increasing throughput, even though we outperforms Neo4j a lot already with a single thread. The following figure shows the scalability of Native Store on some third-party queries on a LDBC social network graph benchmark.


The above evaluation clearly shows the superior performance of native store compared to the open source solutions.

(Update: 1/31/15) We kept receiving requests to compare performance on various datasets and applications. For insatnce, in Summer/Fall 2014, we were requested to evaluate an application that needs to find paths in a property graph of 12.2 million edges and 2.2 million vertices. The goal was to find and keep trace of all the paths from a start node (drug) to a target node (sympton) in this graph of depth 23. In that experiment which measured the average time of 100 use cases applied on these graph DBs: Titan HBase took 2682 seconds, Titan BerkeleyDB took 794 seconds, Neo4j took 8.3 seconds, System G Native Store with Java interface took 6.2 seconds, and System G Native Store with native interface took 4.2 seconds. Note that neither Neo4 nor System G needed tuning. The default configuration of Titan did not return results. Our engineer (senior technical staff member) spent 3 weeks to tune Titan until we got the above best results.


Note that the above scientific experiments and technical discussions were done by researchers in IBM T. J. Watson Research Center by investigating open source codes to understand why they performed the way as observed. It is a best-effort research investigation, which does not imply any policy.






Examples of Native Store REST APIs

Each gShell command can be converted into a REST-like APIs. Here are show a few exampels. More details can be found at the Cloud webpage.
  • List all stores and their types. This command will list all existing stores (opened or not opened) and their respective graph type (directed, undirected, pred_directed). Note that this command will not load any store if they are not opened already.
    http://yourserver/nsREST?cmd=list_all
  • Create a store. Assuming we want to create a graph store called "", and we want use it to store an undirected graph. Note that a graph is not necessarily to be connected in gShell; that is, a graph store can maintain a set of small graphs, but all of them must be of the same type (directed, undirected, or pred_directed). The command for creating such a store is:
    http://yourserver/nsREST?cmd=create&storename=<store_name>&graphtype=<type> 
  • To get the number of vetices in a graph, we use the following command.
     http://yourserver/nsREST?storename=<store_name>&cmd=get_num_vertices
  • To get the number of neighbors of a vertex.
    http://yourserver/nsREST?storename=<store_name>&cmd=get_num_neighbors&vertex_id=<vertex_id>
  • To query all neighbors of a vertex, we use the following command. We just need to provide a vertex ID.
    http://yourserver/nsREST?storename=<store_name>&cmd=query_neighbors&vertex_id=<vertex_id>
  • Find the vertex with the maximum node degree. If the condition i and val are given, it finds the vertex only from those satisfying the condition, i.e., the i-th property of the vertex is equal to val. It also finds the vertices with the top #n number of degrees.
    http://yourserver/nsREST?storename=kv11&cmd=find_vertex_max_degre
  • Find n vertices randomly from a graph. This command is for users to get some vertices, so that they can use such vertices as start points for certain analytics. is a number, say 10.
     http://yourserver/nsREST?storename=<store_name>&cmd=find_random_vertices&n=<#n>
  • A user controled Breadth First Search (BFS) can be invoked by the following commands. is an arbitrary vertex in the graph stored in , and #hops shows the maximum allowed BFS levels. gives the maximum number of vertices to traverse at each BFS level. This is for visualization purpose, where we do not want to visualize all the edges for dense vertices. The output is in JSON format.
    http://yourserver/nsREST?storename=<store_name>&cmd=bfs_visual&root=<vertex_id>&hops=<#hops>&max_breadth=<#max_breadth>



Native Store Demo Online (please refere to IBM System G on Cloud webpage for the latest updates/decriptions/links.)

Click any one of the following images to access an online demonstration of native store, where a user can upload graph data in various formats, such as the CSV format, and then perform queries. The query result can be visulized, if the command has a JSON output.

The online demo webs are actually built on top of gShell. An easy way to try out the demo is to invoke the collaborative fitlering in gShell. We can use the dataset kv11test3, a pre-loaded graph of IBM KnowledgeView dataset in CSV format. Then, in the query text box, the following command can be typed in:


centroid_visual Q727718S84162V06 10 10 json

This is an instance of the gShell command:


centroid_visual \ \<#rank1\> \<#rank2\> [json]

The first argument for this command is also the queried vertex ID, but it performs a 2 hops colFilter and got the top \<#rank1\> nodes, and then for each of them perform 2 hops again and get the top 10 nodes. The results are of the top #rank from the 1st colFilter and the top \<#rank2\> of the remaining colFilter are aggregated to return. The result can be formatted into JSON if the optional argument is specified.


demo  
demo_lite
Fig 5. IBM System G Native Store Online Demo (Click to invoke)

Then, a user can click/choose Text to see the JSON output, or click/choose Visualize to visualize the results.


Note: the gShell version in the online demo is older than the latest one that you can downloaded. The documentation describes the latest version of the Native Store, so some command may not work in the demo.




TinkerPop over Native Store

TinkerPop Blueprints

IBM System G has a JNI layer to translate the Native Store graph APIs into the TinkerPop APIs. Therefore, JAVA graph applications built on top of the TinkerPop Blueprint can be ported onto the IBM System G Native Store. Therefore, various Open Source tools can be integrated into the IBM System G.

IBM System G supports Java clients through an in-process JNI layer that maps the concepts and methods of the multi-property C++ layer to Java static methods. This has allowed the System G team to leverage existing open source Java-based code bases to implement additional graph features. As a result, IBM System G users have the ability to access their graphs through Groovy, Gremlin and SPARQL. In most cases Java clients do not program to the JNI interface but instead program to a TinkerPop Blueprints layer built upon the JNI methods.

System G provides TinkerPop Blueprints interfaces to both it's high-performance C++ implementations and its HBase-based GBase graphs. These gives customers the ability to use the TinkerPop suite against their NativeStore and GBase graphs. Clients have found the most valuable of these to be Gremlin. We have high hopes for TinkerPop Rexster, but so far we've found the Rexster (2.3, 2.4) public implementations to be unstable and have not found time to diagnose and patch the problems. The System-G team itself has found the TinkerPop test suite to be the most valuable portion of the TinkerPop suite and has found it handy for rapidly validating portions of IBM System G.

Trying it out

If you'd like to try out TinkerPop over NativeStore, you can simply use maven to include it in your project on Linux
  <properties>
    <nexus.url>http://vcqa-1.pok.ibm.com:8081/nexus</nexus.url>
    <nexus.snapshots>${nexus.url}/content/repositories/snapshots/</nexus.snapshots>
    <nexus.internal>${nexus.url}/content/repositories/releases/</nexus.internal>
    <nexus.forwala>${nexus.url}/content/repositories/thirdparty/</nexus.forwala>
  </properties>
  <repositories>
        <repository>
          <id>nexus_internal2</id>
          <url>${nexus.internal}</url>
          <releases><enabled>true</enabled></releases>
          <snapshots><enabled>false</enabled></snapshots>
        </repository>
        <repository>
          <id>nexus_forwala</id>
          <url>${nexus.forwala}</url>
          <releases><enabled>true</enabled></releases>
          <snapshots><enabled>false</enabled></snapshots>
        </repository>
        <repository>
          <id>nexus_snapshots2</id>
          <url>${nexus.snapshots}</url>
          <releases><enabled>false</enabled></releases>
          <snapshots><enabled>true</enabled></snapshots>
        </repository>
  </repositories>
  <dependencies>
  ...
    <dependency>
        <groupId>com.ibm.research.systemg</groupId>
        <artifactId>NativeBlueprints</artifactId>
        <version>2.4.0-SNAPSHOT</version>
    </dependency>
     ...
  </dependencies>
      
or you might find it easier to simply pull down some of our sample code from git and modify it to suit your needs :

      git clone git://git01.pok.ibm.com/SGAPISamples       
      
A Cloud trial version can be found here.


Gremlin over Native Store

marko$ ./bin/gremlin.sh \,,,/ (o o) -----oOOo-(_)-oOOo----- gremlin>

Since Tinkerpop is supported, Gremling is running. Gremlin is a domain specific language for traversing graphs. By using Gremlin, it is possible make use of a REPL (command line/console) to interactively traverse a graph. Gremlin was designed to work with a type of graph called a property graph. By using Gremlin, it is possible make use of a REPL (command line/console) to interactively traverse a graph. Since Native Store provides Tinkerpop/Blueprints interface via JNI, Gremlin is running on Native Store. In the follow webpage, we provides instructions on installing Native Store Gremlin and using the Gremlin Console. Getting Started with Gremlin (/db/doc/gremlin.html)


SPARQL over Native Store

JENA

A JENA based SPARQL query engine is installed on top of the System G Native Store. The queries are translated into graph APIs in TinkerPop, and then translated into the Native Store APIs. Thus, Native Store opens an opportunity to leverage several Open Source graph solutions for some rapid application developement. System G provides SPARQL support. The initial implementation of this was based on TinkerPop's SailGraph and GraphSail classes. Current work is based on Apache Jena and is focusing on improving the performance of that implementation.