IBM System G Native Graph Store Overview

Further documentation of Native Store:
(1) Software Documentation



Native Store

IBM System G Native Store provides an efficient graph data store solution that can handle various simple graphs, property graphs, and RDF-like graphs, in terms of storage, analytics, and visualization.

Native Store Solution
Fig 1. System G Native Store Solution

Native store not only offers persistent graph storage, but also sequential/concurrent/distributed graph runtimes, a set of C++ graph programming APIs, a CLI command set (gShell), a socket client, a socket client GUI, and some visualization toolkit.

db_arch
Fig 2. System G Native Store Overview


Native Store Introduction

System G Native Store provides several in-memory and on-disk storages for property graphs. This web primarily introduces System G Native Store, a C++ template based high performance implementation of graph stores. The Native Store is designed according to the data access patterns in various graph applications and the architecture of commodity processors, aiming at offering highly efficient graph data access for Big Data analytics. Native Store supports sequenetial, concurrrent, and distributed graph runtime libraries in System G middleware, and provides

  • C++ APIs for developers
  • a command based gShell for high leqvel users
  • a JNI layer for translating the native APIs into TinkerPop APIs
  • a SPARQL layers based on JENA through the TinkerPop layers for RDF queries
For users of different levels, the gShell further provides highly intuitive database operations, supporting
  • local interactive mode in CLI
  • socket-based remote operation mode
  • a Web-based interactive mode
  • local Qt GUI frontend.


roadmap
Fig 3. System G Native Store Roadmap


System G Native Store Evaluation

We experimentally show the impact of data storage on the performance of graph analytics using a straightforward graph query for visualizing recommendation. We utilized a real data set from IBM knowledgeView, which consists of about 72,300 user vertices, 82,100 document vertices, and over 1,740,000 edges. Given a vertex of a document d, the graph query creates a subgraph consists of top 100 most relevant documents to d, and we built an edge between the found document and d; then, for each found document we find the top 10 documents again. We implemented the query on graphs that were stored in various storages, including two relational database (Oracle and DB2), a K/V store (Berkeley DB) with Titan, a HBase with Titan, a graph store (Neo4j) and the proposed System G. Even thought the implemented algorithm is the same, the performance in terms of execution time varies significantly. The two relationship database (RDB) are much slower, possibly due to the overhead in table joins. The graph stores (Neo4j and System G) shows superior performance (1.2 and 0.7 seconds, respectively), since the data is organized both in memory and on disk as a graph, resulting in efficient graph traversal and updates. The striking contrast justifies the importance of graph data store for large scale graph analytics. Neo4j is based on JVM, so it is hard to deep optimize on particular platforms. System G shows improved performance compared to Neo4j implementation in this experiments, because of our efforts on optimizing the cache performance. The performance of graph traversal in various databases is shown below:

experiment
Fig 4. Graph traversal time in various databases


Further Experimental Comparison with Neo4j and Titan BDB

We compare some primitive graph operations using System G Native Store, Neo4j, and Titan over Berkeley DB. These operations are used as building blocks for developing various graph analytics. Therefore, the performance of such primitives are highly important for developing efficient graph analytic solutions. We conducted all the experiments on a server with Intel Hashwell (Xeon E5-2697 v2) processors at 2.7 GHz. The memory size is 256 GB. The OS is the RedHat Linux.

We first study the following three primitive operations:

  • AddVtx: add a vertex to the graph
  • AddEdge: add an edge to the graph
  • QryVtx: find a vertex based on its property

We compare the operations wrt different graph sizes, where the number of vertices ranges from 100,000 to 1,000,000, and the average node degree is 10. The number for plotting the figures are the average results. We actually measured the time for adding all vertices and adding all edges. Thus, the result is not biased toward some vertices or edges, such as those with a few neighbors. We calculate the average and show them below.

We further increased the graph sizes from 1M vertices to 10M and repeated the experiment. The results are plotted as follows, which exhibit consistent comparison as what we can observe from the above figures. Note that we do not have the experimental data for Titan over Berkeley DB, the machine hanged and did not return results for traversing the graph over 3 million edges. We had to terminate the machine after waiting for 24 hours.


The bottom-right figure shows the ratio of query times for Neo4j and Titan BDB vs Native Store. They are several times slower than Native Store.

The results show that Neo4j is faster than Native Store only in adding vertices. The reason is that Native Store persists every single vertex adding; while Neo4j caches the data and writes it back (instead of write through). In terms of adding edges and query vertices, it clearly shows than Native Store achieves much better performance. The time is approximately in constant; while such time for Neo4j increases as the graph gets larger.

For Titan, it shows similar performance with Native Store for inserting vertices and edges, but lower performance for query vertices. By looking at the Titan source code, we realized two fundamental design issues might be the reason: (1) Titan's underlying data structure is not linked data, thus we can not directly follow the links when traversing a graph; (2) Titan strongly relies on indexing, which can adversely impact the database performance when the graph is large and dynamic.

Second, let's study the throughput of graph traversal. After the graph is populated by adding vertices and edges, it is ready for running graph analytics. Many graph analytics requires traversal of subgraphs. So, the traversal peformance can be highly critical. In the following experiments, we start from a vertex and travse an ego network subgraph in BFS manner for three hops. Then, we sort the vertices at three hops away from the starting vertex. We measured the number of edges traversed in each query (TEPS) and show the results accordingly. The result is shown in the following figure. As it shows, Native Store exhibits the best performance, and Neo4j is the second. Note that the curve for Native Store look a little bit more fluctuated. This is caused by the very low execution of our implementation, so any system noise can make the fluctuation look stronger.


We further increased the graph sizes and obtained consistent observations on graph traversal throughput as follows.


Third, let's study the scalability. It is worth noting that all the above comparisons were done using a SINGLE thread in System G Native Store. Native Stores offers three runtime libraries: a single-thread version, a concurrent (multithreading) version, and a distributed version (with PAMI), all with almost the same graph APIs, for greatly reducing your porting efforts. There is no such choice available in Neo4j or BerkeleyDB. Using multithreading, Native Store has increasing throughput, even though we outperforms Neo4j a lot already with a single thread. The following figure shows the scalability of Native Store on some third-party queries on a LDBC social network graph benchmark.


The above evaluation clearly shows the superior performance of native store compared to the open source solutions. There are couple of reasons explaining the advantages of System G Native Store:

  • First, native store is designed from scratch for highly efficient graph computing. Its data structures are optimized at multiple layers (disk storage, graph-specific file caching, cpu-cache amenable in-memory graph data structure, optimized scheduler for concurrent graph access, etc.) towards graph computing characteristics, such as the irregular data access patterns. For example, our cpu-cache amenable in-memory graph data structure improves cache hit rate. The L1D cache hit rate is pretty high, which is not common for general graph representations.
  • Second, the C/C++ based kernel of the graph store is easy to incorporate with various hardware innovations. Compared to many open source competitors, our solution offers more opportunties for performance optimization. In constrast, the Java-based packages (Titan, Neo4j) have challengs to be optimized when beyond the JVM layer.
  • Third, unlike Titan where a graph front-end is provided, but a non-graph back-end is used, we organize data always using graph across layers. Therefore, at each layer, we can optimize operations towards graph computing. It is straightforward to see that, if the back-end is not a graph, some graph behaviors may not be well handled. For example, the Titan back-end heavily relies on indices. Although indexing helps, it can also introduce overheads when the graph is very dynamic, where many indices must be updates at runtime. Native graph store, such as our native store and Neo4j, have back-end as a graph, where graph data access and traversal is straightforward.
  • At last, for implementing such cross-layer optimization for graphs, we build our team by having researchers with backgrounds in graph analytics, systems, databases, POWER architecture, high performance computing, compilers. The multi-disciplineary team ensures the success of the solution.



Examples of Native Store REST APIs

Each gShell command can be converted into a REST-like APIs. Here are show a few exampels. More details can be found at this webpage.
  • List all stores and their types. This command will list all existing stores (opened or not opened) and their respective graph type (directed, undirected, pred_directed). Note that this command will not load any store if they are not opened already.
    http://yourserver/nsREST?cmd=list_all
  • Create a store. Assuming we want to create a graph store called "", and we want use it to store an undirected graph. Note that a graph is not necessarily to be connected in gShell; that is, a graph store can maintain a set of small graphs, but all of them must be of the same type (directed, undirected, or pred_directed). The command for creating such a store is:
    http://yourserver/nsREST?cmd=create&storename=<store_name>&graphtype=<type> 
  • To get the number of vetices in a graph, we use the following command.
     http://yourserver/nsREST?storename=<store_name>&cmd=get_num_vertices
  • To get the number of neighbors of a vertex.
    http://yourserver/nsREST?storename=<store_name>&cmd=get_num_neighbors&vertex_id=<vertex_id>
  • To query all neighbors of a vertex, we use the following command. We just need to provide a vertex ID.
    http://yourserver/nsREST?storename=<store_name>&cmd=query_neighbors&vertex_id=<vertex_id>
  • Find the vertex with the maximum node degree. If the condition i and val are given, it finds the vertex only from those satisfying the condition, i.e., the i-th property of the vertex is equal to val. It also finds the vertices with the top #n number of degrees.
    http://yourserver/nsREST?storename=kv11&cmd=find_vertex_max_degre
  • Find n vertices randomly from a graph. This command is for users to get some vertices, so that they can use such vertices as start points for certain analytics. is a number, say 10.
     http://yourserver/nsREST?storename=<store_name>&cmd=find_random_vertices&n=<#n>
  • A user controled Breadth First Search (BFS) can be invoked by the following commands. is an arbitrary vertex in the graph stored in , and #hops shows the maximum allowed BFS levels. gives the maximum number of vertices to traverse at each BFS level. This is for visualization purpose, where we do not want to visualize all the edges for dense vertices. The output is in JSON format.
    http://yourserver/nsREST?storename=<store_name>&cmd=bfs_visual&root=<vertex_id>&hops=<#hops>&max_breadth=<#max_breadth>



Native Store Demo Online (under construction for IBM Cloud)

Click any one of the following images to access an online demonstration of native store, where a user can upload graph data in various formats, such as the CSV format, and then perform queries. The query result can be visulized, if the command has a JSON output.

The online demo webs are actually built on top of gShell. An easy way to try out the demo is to invoke the collaborative fitlering in gShell. We can use the dataset kv11test3, a pre-loaded graph of IBM KnowledgeView dataset in CSV format. Then, in the query text box, the following command can be typed in:


centroid_visual Q727718S84162V06 10 10 json

This is an instance of the gShell command:


centroid_visual \ \<#rank1\> \<#rank2\> [json]

The first argument for this command is also the queried vertex ID, but it performs a 2 hops colFilter and got the top \<#rank1\> nodes, and then for each of them perform 2 hops again and get the top 10 nodes. The results are of the top #rank from the 1st colFilter and the top \<#rank2\> of the remaining colFilter are aggregated to return. The result can be formatted into JSON if the optional argument is specified.


demo  
demo_lite
Fig 5. System G Native Store Online Demo (Click to invoke)

Then, a user can click/choose Text to see the JSON output, or click/choose Visualize to visualize the results.


Note: the gShell version in the online demo is older than the latest one that you can downloaded. The documentation describes the latest version of the Native Store, so some command may not work in the demo.




TinkerPop over Native Store

TinkerPop Blueprints

System G has a JNI layer to translate the Native Store graph APIs into the TinkerPop APIs. Therefore, JAVA graph applications built on top of the TinkerPop Blueprint can be ported onto the System G Native Store. Therefore, various Open Source tools can be integrated into the IBM System G.

System G supports Java clients through an in-process JNI layer that maps the concepts and methods of the multi-property C++ layer to Java static methods. This has allowed the System G team to leverage existing open source Java-based code bases to implement additional graph features. As a result, System-G users have the ability to access their graphs through Groovy, Gremlin and SPARQL. In most cases Java clients do not program to the JNI interface but instead program to a TinkerPop Blueprints layer built upon the JNI methods.

System G provides TinkerPop Blueprints interfaces to both it's high-performance C++ implementations and its HBase-based GBase graphs. These gives customers the ability to use the TinkerPop suite against their NativeStore and GBase graphs. Clients have found the most valuable of these to be Gremlin. We have high hopes for TinkerPop Rexster, but so far we've found the Rexster (2.3, 2.4) public implementations to be unstable and have not found time to diagnose and patch the problems. The System-G team itself has found the TinkerPop test suite to be the most valuable portion of the TinkerPop suite and has found it handy for rapidly validating portions of System G.

Trying it out

If you'd like to try out TinkerPop over NativeStore, you can simply use maven to include it in your project on Linux
  <properties>
    <nexus.url>http://vcqa-1.pok.ibm.com:8081/nexus</nexus.url>
    <nexus.snapshots>${nexus.url}/content/repositories/snapshots/</nexus.snapshots>
    <nexus.internal>${nexus.url}/content/repositories/releases/</nexus.internal>
    <nexus.forwala>${nexus.url}/content/repositories/thirdparty/</nexus.forwala>
  </properties>
  <repositories>
        <repository>
          <id>nexus_internal2</id>
          <url>${nexus.internal}</url>
          <releases><enabled>true</enabled></releases>
          <snapshots><enabled>false</enabled></snapshots>
        </repository>
        <repository>
          <id>nexus_forwala</id>
          <url>${nexus.forwala}</url>
          <releases><enabled>true</enabled></releases>
          <snapshots><enabled>false</enabled></snapshots>
        </repository>
        <repository>
          <id>nexus_snapshots2</id>
          <url>${nexus.snapshots}</url>
          <releases><enabled>false</enabled></releases>
          <snapshots><enabled>true</enabled></snapshots>
        </repository>
  </repositories>
  <dependencies>
  ...
    <dependency>
        <groupId>com.ibm.research.systemg</groupId>
        <artifactId>NativeBlueprints</artifactId>
        <version>2.4.0-SNAPSHOT</version>
    </dependency>
     ...
  </dependencies>
      
or you might find it easier to simply pull down some of our sample code from git and modify it to suit your needs :

      git clone git://git01.pok.ibm.com/SGAPISamples       
      
See git01.pok.ibm.com for more information about our git repository.


Gremlin over Native Store

marko$ ./bin/gremlin.sh \,,,/ (o o) -----oOOo-(_)-oOOo----- gremlin>

Since Tinkerpop is supported, Gremling is running. Gremlin is a domain specific language for traversing graphs. By using Gremlin, it is possible make use of a REPL (command line/console) to interactively traverse a graph. Gremlin was designed to work with a type of graph called a property graph. By using Gremlin, it is possible make use of a REPL (command line/console) to interactively traverse a graph. Since Native Store provides Tinkerpop/Blueprints interface via JNI, Gremlin is running on Native Store. In the follow webpage, we provides instructions on installing Native Store Gremlin and using the Gremlin Console. Getting Started with Gremlin (/db/doc/gremlin.html)


SPARQL over Native Store

JENA

A JENA based SPARQL query engine is installed on top of the System G Native Store. The queries are translated into graph APIs in TinkerPop, and then translated into the Native Store APIs. Thus, Native Store opens an opportunity to leverage several Open Source graph solutions for some rapid application developement. System G provides SPARQL support. The initial implementation of this was based on TinkerPop's SailGraph and GraphSail classes. Current work is based on Apache Jena and is focusing on improving the performance of that implementation.