Ontology extension and News analysis

January 19th, 2012

What are ontologies?

The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries (W3C, 2011). Semantic technology is a general term for any software that involves some kind and level of understanding the meaning of the information it deals with (Semantic Technology and Linked Data Annotation, 2011). Ontologies are considered one of the pillars of the Semantic Web (Semantic Web. Ontology, 2011).
Gruber (1993) defined Ontology as an explicit specification of a conceptualization. According to (Gruber, 1993), ontologies consist of the following main components: concepts, relations, functions, axioms and instances. Ontologies enable effective domain knowledge representation, knowledge sharing and knowledge reuse (Chandrasekaran et al., 1999).

                                                                                                                                                                                             Cyc Knowledge Base

Currently, Cyc operates on one of the largest knowledge bases in the contemporary IT world.
It is stated as “a formalized representation of a vast quantity of fundamental human knowledge: facts, rules of thumb, and heuristics for reasoning about the objects and events of everyday life” (Cycorp, what’s in Cyc, 2011) and divided into the large number of “microtheories”, each of which represents the set of assumption for a particular knowledge domain.
At the present time, the Cyc KB contains nearly two hundred thousand terms and an average of several dozen hand-entered assertions about/involving each term. New assertions are continually added to the KB by human knowledge enterers. Additionally, term-denoting functions allow for the automatic creation of millions of non-atomic terms, such as (LiquidFn Nitrogen); and Cyc adds a vast number of assertions to the KB by itself as a product of the inference process (Cycorp, what’s in Cyc, 2011).

                                                                                                                                                                                             Ontology extension task

OntoPlus is a novel methodology for semi-automatic ontology extension. OntoPlus introduces usage of ontology content, structure and co-occurrence information.

Ontology content of a particular concept is defined as the available textual representation of the referred concept. The ontology content includes a natural language concept denotation (lexical entries for a particular concept) and textual comments about the concept.  Ontology structure of a particular concept is defined as the neighborhood concepts involved in the hierarchical and non-hierarchical relations with the referred concept. For instance, an example Cyc concept CommonStock has the associated content “Share; Ordinary share; The collection of Stock whose instances represent owners (stockholders) who have only a residual claim on an Organization’s assets after all debts and claims generated by PreferredStocks have been met..” and the associated structure “Equities; shares; stocks; issues; class B stocks; class A stocks…” etc.  Co-occurrence information is represented by the occurrence of two or more concepts within a defined textual block. The available textual information is used to find the co-occurrences between existing ontology concepts and new domain concepts suggested for ontology extension.

Ontology extension in our research stands for adding new concepts to the existing ontology or, augmentation of the existing textual representation of the relevant concepts present in the ontology with new available textual information – extension of the concept comments, changing or adding concept denotation. By Ontology population we consider adding new instances of concept (e.g., LehmanBrothers as Business, MingchunSun as Person) or relation instances (e.g., positionOfPersonInOrganization MingchunSun LehmanBrothers Economist) into the ontology.

OntoPlus methodology uses the combination of the ontology content, the ontology structure information and the co-occurrence data between existing and candidate ontology concepts and consists of several methodology phases:

1. Domain information identification. The user identifies the appropriate domain keywords. As well, in this module a domain relevant glossary, containing terms with descriptions is determined.
2. Extraction of the relevant domain ontology subset from multi-domain ontology. The relevant domain ontology subset is obtained based on the specified domain information. The domain keywords are mapped to the natural language representation of the ontology domain information and a set of the relevant domains of interest is identified. Further, ontology concepts defined in these domains are extracted.
3. Domain relevant information preprocessing. The information from the domain-relevant glossary and the extracted relevant ontology subset are linguistically preprocessed. The preprocessing phase includes tokenization, stop-word removal and stemming. Textual information is represented using bag-of-words representation with TFIDF weighting and similarity between two text segments is calculated using cosine similarity between their bag-of-words representations. For each term from the domain relevant glossary we compose a bag-of-words aggregating preprocessed textual information from: (1) the glossary term name and (2) the term comment. For each concept from the extracted relevant ontology subset the following information is considered: (1) the ontology concept content consisting of the preprocessed natural language concept denotation and concept comment; (2) the ontology concept structure consisting of the preprocessed natural language concept denotation and natural language denotations of hierarchically and non-hierarchically related concepts. In addition, for relation identification, for each ontology concept we compose two additional bags-of-words: one with natural language denotation of the concept and natural language denotations of superclasses of this concept, another with natural language denotation of the concept and natural language denotations of subclasses of this concept.
4. Composing the list of potential concepts and relationships for ontology extension. The ranked list of the relevant concepts and possible relationships suitable for ontology extension is composed in this phase. Cosine similarity between glossary term and ontology concept content is calculated and weighted with weight defined by the user. Cosine similarity between glossary term and ontology concept  structure is calculated and weighted with weight defined by the user. We use Jaccard similarity to measure the co-occurrence of glossary term and ontology concept.
Ontology concepts with similarity larger than a defined similarity threshold are suggested to the user.
To propose the relationship of equivalence we use string-edit distance between glossary term names and the related concept names. In the case of equivalence, the user can extend textual representation of the related ontology concept. Hierarchical subclass relationship is proposed, when the similarity between the glossary term and subclasses of the related concept is higher than the similarity between the glossary term and superclasses of the related concept.
5. User validation. Furthermore, the user validates the candidate entries results consisting of the glossary terms, existing ontology concepts and glossary term-ontology concept relationships. In case of the equivalence relationship the user can extend the textual representation of the existing ontology concept by adding comment, adding or changing the natural language denotation.  In case of the hierarchical relationships the user can add subclasses to the existing ontology concepts. If the nature of the relationship is not clear, the user can create an associative relationship or choose any other relationship between a glossary term and existing ontology concept.  Moreover, the list with validated entries in the relevant format is created.
6. Ontology extension. It represents adding the new concepts and relationships between concepts into the ontology.
7. Ontology reuse. The ontology reuse phase serves as the connection link between separate ontology extension processes. As a part of the new extension process, we reuse the previously extended ontology.

OntoPlus methodology allows transforming textual information into a structured conceptualized form. OntoPlus methodology is able to perform within different domains and different information sources. The methodology enables extension of very large multi-domain ontologies.

                                                                                                                                                                                             Pipeline for business news analysis

In our research we have also proposed the pipeline for business news analysis, which applies OntoPlus methodology for ontology population. The proposed pipeline accounts for the following phases: News website definition, News crawling, Concept, entity, event, fact extraction from news, Concept, entity, event, fact mapping to the Cyc KB, Questions definition, Questions answering.
1. News website definition. In the first phase of the pipeline for business news analysis a list of websites, which contain business news, is defined by the user (e.g., business or financial analyst).
2. News crawling. The news articles are crawled from the RSS feeds of the provided websites and afterwards, news cleaning is performed. Every news item represents a separate textual file.
3. Concept, entity, event, fact extraction from news. In this phase a set of financial concepts is extracted from business news. Using N-grams extractor from TextGarden tools (Text-Garden, 2011), it is possible to get all N-grams from the textual news collection and map them to the terms in the Harvey financial glossary (Harvey, 2003).
With a fact extraction service, such as OpenCalais tool, we are able to extract the information about entities, events and facts present in our news collection.
4. Concept, entity, event, fact mapping to the Cyc KB. In this phase ontology extension and ontology population are performed. With the OntoPlus methodology we are able to extend the Cyc KB with terms from the financial glossary, which occurred in our news collection. For ontology population we have created a set of mappings between OpenCalais entities, events and facts types and Cyc concepts – collections and predicates. We also apply the OntoPlus methodology for concept disambiguation in the ontology population process.
5. Question definition. For question definition a set of questions, involving reasoning aspects is composed. For the business news analysis we have composed business related questions.
6. Question answering. The questions are asked using the Cyc reasoning interface and Cyc proofs are analyzed.

                                                                                                                                                                                             Experiments and results

In order to evaluate the proposed OntoPlus methodology for ontology extension and the pipeline for business news analysis, we have conducted a series of experiments on the data sources, addressing different aspects of the proposed methodologies. Three types of conducted experiments included:
–     Tagging experiments;
–     Ranking experiments;
–     Question answering experiments.
The proposed OntoPlus methodology has been evaluated using a well known Cyc ontology and textual material from two domains – financial domain and fisheries & aquaculture domain.

Question answering experiments contain the evaluation of the pipeline for business news analysis. Question answering experiments demonstrate the capacity of Cyc to answer business news related questions before and after the extension of the Cyc Knowledge Base.

Tagging experiments

Tagging experiments have been executed in financial domain. Tagging experiments show how the business news tagging with ontology components improves after ontology extension with the domain relevant glossary. For tagging experiments we have calculated the precision  and recall of news tagging before and after adding terms to the ontology.

Ranking experiments

We have attributed special attention to our ranking experiments and to the evaluation of the OntoPlus methodology for ontology extension. The ranking experiments are conducted in two domains: financial domain and fisheries & aquaculture domain. Ranking experiments demonstrate how using OntoPlus methodology we can semi-automatically extend the large lexical ontology with new concepts and identify the correspondent relationships between existent ontology concepts and domain glossary terms.
For the proposed OntoPlus methodology evaluation we have used two evaluation techniques – the manual evaluation by human experts and gold standard based approach (Dellschaft and Staab, 2008).
The evaluation of OntoPlus methodology is performed at the lexical, taxonomic (concept hierarchy) and non-taxonomic levels. For the lexical evaluation the mapping of the glossary terms to the existent ontology concepts is performed. At the taxonomic level the evaluation of the suggested hierarchically related concepts and suggested superclass-subclass relationships is implemented. Finally, at the non taxonomic relations level the evaluation of the suggested associatively related concepts and associative relationships is done. While the gold standard based approach is used to perform lexical and taxonomic evaluation, the manual evaluation is used at the non-taxonomic level.
Maedche and Staab (2002) have used the normalized string edit distance to identifying how similar two ontologies are. Normalized string edit distance between ontology concept denotations and glossary term names is used as a baseline measure in the evaluation of the proposed OntoPlus methodology.
In order to define how successful the proposed methodology is in practice, we have used a number of evaluation measures commonly used for ontology learning evaluation, as follows.
Precision of the top suggested concept defines the percentage of the glossary terms for which the equivalent and hierarchical, associative or any related ontology concepts have obtained the highest position in the suggested ranked related concept list.
Learning Accuracy (Hahn and Schnattinger, 1998) shows the degree to which the proposed methodology correctly predicts the superclass for the candidate ontology concept (represented by a glossary term) to be learned.
In addition, we have used a hit rate measure used in the evaluation of recommendation systems. The hit rate displays the number of hits and their position within top N suggestions (Deshpande and Karypis, 2004).

Question answering experiments

The pipeline for business news analysis is evaluated within question answering experiments. Question answering experiments reveal the capacity of Cyc to answer business news related questions before and after the extension and population of Cyc Knowledge Base. As well as in the tagging experiments, because the available news collection included only business and financial news (and no fisheries & aquaculture news), question answering has been conducted only in business and financial domain. For question answering experiments we have calculated precision of ontology population and the precision of news related question answering.

                                                                                                                                                                                             Our findings

We have found that the best results are achieved by combining content, structure and co-occurrence information, where the combination of weights depends on the domain. In our case, ontology content and structure are more important than co-occurrence for data in financial domain. At the same time, ontology content and co-occurrence have higher importance for data in fisheries & aquaculture domain.

We have found that the extension of the Cyc Knowledge Base according to the proposed OntoPlus methodology, population of the Cyc Knowledge Base with entities, events and facts extracted from business news allows users to perform a question answering based on the extended and populated ontology.

                                                                                                                                                                                             References
ASFA thesaurus, http://www4.fao.org/asfa/asfa.htm (accessed June 2010).

Chandrasekaran, B.; Josephson, J. R.; Benjamins, R. V. What are Ontologies and why do we need them? In: IEEE Intelligent Systems and Their Applications 14, 20-26 (1999).

Cycorp, Inc., what’s in Cyc,     http://www.cyc.com/cyc/technology/whatiscyc_dir/whatsincyc  (accessed July 2011).\

Dellschaft, K.; Staab, S. Strategies for the Evaluation of Ontology Learning. In: Cimiano P. and Buitelaar P. (eds.) Bridging the Gap between Text and Knowledge – Selected Contributions to Ontology Learning and Population from Text (IOS Press, 2008).

Deshpande, M.; Karypis, G. Item-Based Top-N Recommendation Algorithms. ACM Trans. Inf. Syst. 22/1, 143-177 (2004).

Gruber, T. R. A translation approach to portable ontologies. Knowledge Acquisition 5/2, 199-220 (1993).

Hahn, U.; Schnattinger, K. Towards Text Knowledge Engineering. In: Proceedings of the AAAI’98 (1998).

Harvey, C. R. Yahoo Financial Glossary (Fuqua School of Business, Duke University, 2003).

Maedche, A.; Staab, S.  Measuring similarity between ontologies. In: Proceedings Of the European Conference on Knowledge Acquisition and Management – EKAW-2002 (Madrid, Spain, 2002).

Semantic Technology and Linked Data Annotation,     http://gate.ac.uk/wiki/TrainingCourseMay2011/4th-gate-training.pdf (accessed August 2011).

Semantic Web. Ontology, http://semanticweb.org/wiki/Ontology (accessed August 2011).

W3C, http://www.w3.org/2001/sw (accessed August 2011).

Wikipedia, http://en.wikipedia.org/wiki (accessed June 2011).

                                                                                                                                                                                             Data

For Cyc extension software and data, used in the experiments, please contact Inna Novalija (inna.koval@ijs.si).

 

  1. No comments yet.
  1. July 2nd, 2021 at 15:52 | #1
  2. July 9th, 2021 at 10:20 | #2
  3. July 10th, 2021 at 18:11 | #3
  4. August 4th, 2021 at 19:09 | #4
  5. August 14th, 2021 at 01:43 | #5
  6. August 15th, 2021 at 07:38 | #6
  7. October 25th, 2021 at 00:01 | #7
  8. October 31st, 2021 at 22:07 | #8
  9. December 2nd, 2021 at 23:35 | #9
  10. December 17th, 2021 at 13:45 | #10
  11. December 24th, 2021 at 10:41 | #11
  12. December 26th, 2021 at 06:27 | #12
  13. December 26th, 2021 at 17:30 | #13
  14. December 28th, 2021 at 20:49 | #14
  15. December 31st, 2021 at 11:41 | #15
  16. January 3rd, 2022 at 18:27 | #16
  17. February 14th, 2022 at 16:31 | #17
  18. March 3rd, 2022 at 17:00 | #18
Comments are closed.
PAGE TOP