Automatic Knowledge-based Interpretation of Text Annotation (AKITA)

Conceptual Data Flow

The overall goal of AKITA is to provide an open-source software framework that integrates text annotation, indexing and retrieval with knowledge capture and inference, in support of text mining and business intelligence applications. The basic concept of operations is shown in the figure above:

  1. Automatic Annotation: The incoming data (streaming media such as web pages, blogs, etc.) are annotated using a variety of automatic methods:
    1. Words and phrases from general and domain-specific dictionaries are used to label instances of those words and phrases as members of a particular class (e.g. company names, person names, etc.)
    2. Ranges of text that match pre-defined patterns (e.g., "<person>, <role> of <company>") are labeled as instances of particular relations (e.g., HAS_ROLE(PERSON,ROLE,COMPANY)).
    3. More complex grammars containing rules for combining tokens and annotations into larger structures can also be used to parse the text (e.g., a set of English grammar rules that can determine the phrase structure of a given sentence).
  2. Knowledge Capture and Inference: Text annotations are interpreted as instances of entities, relationships and events in the domain knowledge base, using a general knowledge base that contains basic concepts (e.g. persons, organizations, relationships, events, etc.) and rules of inference (e.g., pre- and post-conditions for domain events; e.g. manufacture implies possess as a post-condition; export implies possess as a pre-condition).
  3. In-Task Feedback: The analyst may query the domain knowledge base for evidence of certain indicators about the entities, relationships and events of interest. While interacting with the system, she may choose to correct items retrieved from the knowledge base which were incorrectly interpreted by the system; or she may choose to interactively highlight content in a document that should have been interpreted as a domain concept, but was missed by the system.
  4. Semi-Supervised Machine Learning: The system makes use of the feedback from the analyst to automatically update the dictionaries, patterns and rules used for automatic annotation.

For more information about AKITA, contact Eric Nyberg (ehn@cs.cmu.edu).

Attachments