IBM System G Native Graph Store Overview
IBM System G Native Store provides an efficient graph data store solution that can handle various graphs, property graphs, and RDF-like graphs, in terms of storage, analytics, and visualization. It includes both Scale Up (which fully utilizes memory and storage of single machine) and Scale Out (which distributes to many machines) features. Graph Database is one of the 8 categergroies of IBM System G Toolkits.
Native store not only offers persistent graph storage, but also sequential/concurrent/distributed graph runtimes, a set of C++ graph programming APIs, a CLI command set (gShell), a socket client, a socket client GUI, and some visualization toolkit.
Native Store Introduction
IBM System G Native Store provides several in-memory and on-disk storages for property graphs. This web primarily introduces System G Native Store, a C++ template based high performance implementation of graph stores. The Native Store is designed according to the data access patterns in various graph applications and the architecture of commodity processors, aiming at offering highly efficient graph data access for Big Data analytics. Native Store supports sequenetial, concurrrent, and distributed graph runtime libraries in IBM System G middleware, and provides
- C++ APIs for developers
- a command based gShell for high leqvel users
- a JNI layer for translating the native APIs into TinkerPop APIs
- a SPARQL layers based on JENA through the TinkerPop layers for RDF queries
- local interactive mode in CLI
- socket-based remote operation mode
- a Web-based interactive mode
- local Qt GUI frontend.
Chacteristics towards better performance:
There are couple of reasons explaining the advantages of IBM System G Native Store:
- First, native store is designed from scratch for highly efficient graph computing. Its data structures are optimized at multiple layers (disk storage, graph-specific file caching, cpu-cache amenable in-memory graph data structure, optimized scheduler for concurrent graph access, etc.) towards graph computing characteristics, such as the irregular data access patterns. For example, our cpu-cache amenable in-memory graph data structure improves cache hit rate. The L1D cache hit rate is pretty high, which is not common for general graph representations.
- Second, the C/C++ based kernel of the graph store is easy to incorporate with various hardware innovations. Compared to many open source competitors, our solution offers more opportunties for performance optimization. In constrast, the Java-based packages have challengs to be optimized when beyond the JVM layer.
- Third, unlike some open source graph databases where only a graph front-end is provided, but a non-graph back-end is used, we organize data always using graph across layers. Therefore, at each layer, we can optimize operations towards graph computing. It is straightforward to see that, if the back-end is not a graph, some graph behaviors may not be well handled. For example, some open source graph database's back-end heavily relies on indices. Although indexing helps, it can also introduce overheads when the graph is very dynamic, where many indices must be updates at runtime. Native graph store, such as our native store and Neo4j, have back-end as a graph, where graph data access and traversal is straightforward.
- At last, for implementing such cross-layer optimization for graphs, we build our team by having researchers with backgrounds in graph analytics, systems, databases, POWER architecture, high performance computing, compilers. The multi-disciplineary team ensures the success of the solution.
Details of the performance improvement, especially based on Caching, Scheduling, and Graph Structure, can be seen on IBM System G Native Store paper published at IEEE BigData 2014.
System G Native Store Evaluation
Published on IEEE BigData 2014, we experimentally showed the impact of data storage on the performance of graph analytics using a straightforward graph query for visualizing recommendation. We measured the performance on two applications (visualization and recommendation) of a production system of IBM knowledgeView, which consists of about 72,300 user vertices, 82,100 document vertices, and over 1,740,000 edges. Given a vertex of a document d, the graph query creates a subgraph consists of top 100 most relevant documents to d, and we built an edge between the found document and d; then, for each found document we find the top 10 documents again. We implemented the query on graphs that were stored in various storages, including a K/V store (Berkeley DB) with Titan, a HBase with Titan, a graph store (Neo4j) and IBM System G Native Store.
Even though the implemented algorithm was the same, the performance of execution time varies significantly on different DBs. Neo4j and System G Native Store showed superior performance (1.2 and 0.7 seconds, respectively), since the data is organized both in memory and on disk as a graph, resulting in efficient graph traversal and updates. The performance of Titan (6.8 seconds with BerkeleyDB and 5.7 seconds with HBase) was not as good as Neo4j and System G, probably due to the fact that it is only a graph representation on the interface level but not really a graph structure store.
The performances of the collaborative filtering application, which returns the top 10 related docs, are: Neo4j: 0.07 sec, Titan BerkeleyDB: 0.28 sec, Titan HBase: 0.41 sec, System G GBase: 0.20 sec, and System G Native Store: 0.015 sec.
Neo4j is based on JVM, so it is hard to deep optimize on particular platforms. System G shows improved performance, based on optimizing the cache, scheduling, and native C/C++ implementation.
TinkerPop over Native Store
IBM System G has a JNI layer to translate the Native Store graph APIs into the TinkerPop APIs. Therefore, JAVA graph applications built on top of the TinkerPop Blueprint can be ported onto the IBM System G Native Store. Therefore, various Open Source tools can be integrated into the IBM System G.
IBM System G supports Java clients through an in-process JNI layer that maps the concepts and methods of the multi-property C++ layer to Java static methods. This has allowed the System G team to leverage existing open source Java-based code bases to implement additional graph features. As a result, IBM System G users have the ability to access their graphs through Groovy, Gremlin and SPARQL. In most cases Java clients do not program to the JNI interface but instead program to a TinkerPop Blueprints layer built upon the JNI methods.
System G provides TinkerPop Blueprints interfaces to both it's high-performance C++ implementations and its HBase-based GBase graphs. These gives customers the ability to use the TinkerPop suite against their NativeStore and GBase graphs. Clients have found the most valuable of these to be Gremlin. We have high hopes for TinkerPop Rexster, but so far we've found the Rexster (2.3, 2.4) public implementations to be unstable and have not found time to diagnose and patch the problems. The System-G team itself has found the TinkerPop test suite to be the most valuable portion of the TinkerPop suite and has found it handy for rapidly validating portions of IBM System G.
Gremlin over Native StoreGremlin is a domain specific language for traversing graphs. By using Gremlin, it is possible make use of a REPL (command line/console) to interactively traverse a graph. Gremlin was designed to work with a type of graph called a property graph. By using Gremlin, it is possible make use of a REPL (command line/console) to interactively traverse a graph. Since Native Store provides Tinkerpop/Blueprints interface via JNI, Gremlin is running on Native Store.
SPARQL over Native Store
A JENA based SPARQL query engine is installed on top of the System G Native Store. The queries are translated into graph APIs in TinkerPop, and then translated into the Native Store APIs. Thus, Native Store opens an opportunity to leverage several Open Source graph solutions for some rapid application developement. System G provides SPARQL support. The initial implementation of this was based on TinkerPop's SailGraph and GraphSail classes. Current work is based on Apache Jena and is focusing on improving the performance of that implementation.