Machine Reasoning Platform
Building a large system consisting of heterogeneous components, each with its own hardware and software requirements, presents numerous complex problems. Add to that the fact that components evolve over time and may be constructed by a disparate collection of researchers and developers, the complexity grows. In this document, we describe some of our design goals and the infrastructure we have created to meet those goals:
- Plug and play component architecture: To enable a system to evolve, and to be easily adapted to new domains, there needs to be a simple way to add new components to the system, or to reconfigure it. This is also extremely useful in a research system, as it evolves over time with as new breakthroughs occur.
- Standard component configuration syntax: As there are likely to be components added to the system that come from many sources, there needs to be a uniform way to configure each.
- Scalability: Components that comprise a complex analytic system can often require vast resources. Meanwhile, many components can be run in parallel. Therefore, it is clear that any system designed for such components must provide a good way to allow for parallel execution as well as the ability to run components on remote machines that provide the resources they require.
- Provenance and Reusability of results: As analytic components are updated and enhanced over time, it is necessary to be able to compare the outputs of different versions of that component, or to compare the outputs of two competing algorithms. To do this, they need to be able to run against the same input, which is likely the output of some prior analytic. This means that output sets need to be saved, and there needs to be a way to track which version of which component produced which set of resultant outputs.
- Ability to test component against prior versions: Once a saved set of results data is known and is useful, a method for starting the system with those results needs to be available so that testing and comparisons can be performed. Some of the functionality needed to satisfy this requirement are the ability to switch data sets quickly and easily, as well as the ability to pick up processing at a specific point in the pipeline.
- Heterogeneous component support: Some components are best built on top of specific technologies, e.g., Hadoop. Therefore, it is necessary to allow as much freedom as possible to the implementer to choose the appropriate implementation.
- Legacy component support: As analytics may already exist that perform required functions, these legacy components need to be inserted into system. Ideally, this should be possible with minimal interface changes, if any, to those components.
To address the above design goals, we took a two pronged approach:
- A Data Manager to act as a broker to persistent storage (database, file system, etc.), thereby providing a single point of access and a way to maintain provenance and reusability of data
- A Flow Manager to act as a wrapper for the various types of components so they can be used within the UIMA framework and act as a bridge between those components and the Data Manager.
The Flow Manager supports two classes of components: those that are not written specifically to be part of the system, including that already exist (i.e., "legacy" components) and those that will be written specifically to be incorporated into the system. Both types of components must be able to be mixed together seamlessly in an analytic pipeline. Each of these two use a different component "wrapper", collectively discussed as the "UIMAWrapper":
- UIMAShellWrapper for incorporating any arbitrary program that can be run from a Unix/Linux shell
- UIMAMethodWrapper for incorporating components that have been written to implement the UIMAWrapped Java interface.
An advantage of using UIMA to build pipelines is that both individual components and sub-pipelines can be deployed as services, either on a single host or distributed, as needed. The deployed services can then be run in parallel or duplicated, depending on system requirements. All of this is specified in UIMA deployment descriptors, which augment AE descriptors. In most typical UIMA-based systems, the data that is processed by components-- their inputs and outputs--flows through the CAS. As mentioned earlier, this is not an acceptable requirement for this system. In this system, all of this data is managed by the Data Manager, and Java properties files are used to describe each input and the output of each component. The Flow Manager also uses properties to parameterize each run of the system, (e.g., the name of the run, which data set to use, etc.). All of these properties are what flow through the pipeline in the CAS, so they are available to each component at run time. The CAS is also used to store information about each individual component that has already executed as part a run, such as whether or not the component completed successfully, how it was configured, and various run statistics.