Feb 222014

Recently, I was reading a paper about building, maintaining and using knowledge bases. I feel that this paper will influence my future research significantly; therefore, I would like to discuss my point of view and thoughts about it.

To begin with, an ontology as a part of the information science is a formal representation of the knowledge. Having such representation may be beneficial to multiple fields and applications, i.e., search, query understanding, recommendations, advertising, social mining, etc. However, there has been a little or no documentation on a whole life cycle of the ontology, including building, maintaining, and using.

In the paper, authors try to answer numerous questions, i.e., what are the pitfalls of maintaining the large knowledge base (KB), what is the influence of the user on the system, how continuous updates and integration should be handled etc. The following choices are among the most distinguishable decisions:

  • Construction of the global ontology-like KB based on Wikipedia, i.e., the KB attempts to cover the entire world capturing all important concepts, instances and relations). Approach that was chosen for constructing a global KB has obvious disadvantages for domain specific construction. In case of building a domain specific ontology, e.g. computer science (CS), it is  unclear how to limit efficiently the Wikipedia mining process for capturing only CS related topics.
  • Enrichment the KB with additional sources, i.e., requirement to enhance the set of instances of the KB has paved the way for involving additional sources with more specific information. Combining several sources of knowledge always lead to the necessity of aligning/merging. This is a process that require information about the context and human intervention. On the other hand, only limited number of resources were processed to enlarge the KB. In my opinion, mining scientific articles in various domains will enrich the KB even further and will give the necessary depth of human knowledge in state of the art domain.
  • Relationships extraction from the Wikipedia, i.e., wikipedia pages that are connected to the KB concepts analysed extensively with well known natural language processing techniques to get free-form relations for concept pairs. By free-form it is assumed that there is no predefined set of possible relations in the KB. This gives a certain freedom but might limit the search as the number of such relations may grow infinitely.
  • KB updates are performed as a rerun of the KB construction pipeline from the scratch. Clearly, this imposes several disadvantages:
    • To rerun the whole KB construction takes substantial time.
    • As the KB is curated by the human (analyst) it is not clear how and to what extent such curation should be used in the future after the rerun. Moreover, the questions is how to utilize preceding analyst intervention to the newly generated KB. In particular, an analyst might change the edge weights, however, if this edge is not present in the newly constructed wiki DAG (or a node will be renamed) is won’t be utilised.
    • Regardless the construction process, an incremental update of the KB seems to be more logical in terms of speed and work facilitation for the analysts. Additionally, it is unclear how a single person can curate the KB of the entire world. In my opinion, aforementioned problems of curation should be placed on multiple people, e.g., crowdsourced.
    • It is unclear how conflicts are resolved when combining different sources of information, e.g., controversial relations, etc.

Aforementioned system design imposes some limitations. First, relation types are not managed and may lack the expressiveness that might be required for some applications, e.g., explainedBy, modelIn, methodsIn, importantIn, etc. This information can be extracted from not only Wiki pages, infoboxes and templates, but also from templates in the bottom of the page. Second, the DAG construction from the cyclic graph extracted from Wikipedia requires additional verification. The constructed model for the weight dissemination in the cyclic graph includes only three parameters, i.e., co-occurrences of the terms in the Web and in the Wiki lists, and name similarity. Third, KB model might benefit from the scientific paper analysis and integration of this information to the KB. Finally, the system might benefit from the data curation by means of crowdsourcing to improve the accuracy and facilitate user contribution.