Download Video: MP4 Format: "MP4" QuickTime Format: "MOV"


FacetAtlas : Visualizing Multifaceted Text Documents as Graphs

Documents are typically unstructured in nature. Visualizing the content of a document corpus and the relationships between documents requires that these unstructured artifacts be transformed into a structured form. We proposed a multifaceted entity relational data model to represent this information in a structured way. This figure illustrates the processing pipeline used to transform a set of raw unstructured documents into our data model.

The first stage in the transformation pipeline is facet segmentation. During this stage, each document is segmented into facet snippets. While various techniques could be used, we typically employ a topic modeling technique such LDA and treat each topic as a facet. When processing documents with a well de?ned structure, we directly use the sections to de?ne facet snippets.

Entity extraction is the second transformation pipeline stage. In this step, a named entity recognition algorithm is applied to each facet's document snippet to generate a set of typed entities. Domain-specific ontology models are used to recognize meaningful entities for each facet.

The third stage in the processing pipeline is relation building. In this stage, connections between extracted entities are established using two types of relations: internal relations and the external relations. An internal relation connects entities within the same facet. An external relation is a connection between entities from different facets.

Multiple visualizations has been developed for visualizing the above data model. FacetAtlas is the one that target on representing multifaceted interal relatoinships within different information facets. SolarMap is another design targets on visualizing external relationship to show rich information context.

Related Publications :

  • "FacetAtlas: Multifacet Visualization for Rich Text Corpora", IEEE Transactions on Visualization and Computer Graphics. InfoVis 2010.[paper]
Click here to read more about this topic