Jun 212014

I haven’t posted for a while, I hope to change it these days:)

I am continuing working on amazing project about scientific knowledge conceptualisation. The first step on the way to reach nirvana with my project is to create a methodology for ontology creation ( aka __repr__ of the knowledge base). Talking about knowledge base and its construction will clearly lead to the problem of disambiguation and necessity of removing duplicates in the knowledge base.

Obviously, construction of the ontology expects the presence of different sources as possible input, e.g., different classifications, glossaries, concepts from various papers. All those sources might produce equivalent concepts but represent/name them differently.

Here, I would like to talk about the simplest approach to resolve disambiguations, misspelling various way to write the same thing – syntactic resolution. Conflicts resolution for multiple source of information on the syntactic level includes the following: detect whether two concepts are the same based on the syntactic structure of the aforementioned.

To tackle this problem, I’ve developed a tool for multiple sources duplicate deletion with human curation upon uncertain pairs of concepts. The tool works as follows:

  • Create an index over all possible words appearing in the concepts (with the following components: its name, its stemmed form, form without special symbols), aka inverted index but with pointers to the term in which the word appeared.
  • List of possible term candidates is created for every pair that is suspicious.
  • Each pair is scored according to the following criteria:
    • Number of equal words (lemmatised)
    • Order of the words, or how the position of words differs
    • Absence or presence of extra first or last word, as it might change the meaning a lot, e.g., Weather and Weather Model, Learning and Adaptive learning.
    • One of the things that is missing is synonyms check

Created set of terms is searched in the article’s corpora. Here, whole article text is normalised (lemmatised, simplified(w/o stop words) etc.) together with the terms in the initial seed. Each term occurrence in the article is detected and accumulated in the maps, e.g., {norm_term: [(doc, start_pos, end_pos, context), …]}

When the occurrences are retrieved the most related pairs are detected, simply by computing a weighted co-occurrence for a pair:

  • Each pair (a, b) can acquire a max of 1 weight in an article.
  • If a pair (a, b) appear in one sentence – the weight is increased by 1, else it is increased by a weight proportional to the distance between terms in the article.
  • Finally, each weight is computed as follows:
    • normalised by a min(count(a), count(b)) for accumulated weight that is smaller than this min
    • or else return 1

Here is a basic representation of the graph that will appear after applying the procedure above.


Out of the available pairs generated from the articles, we select the most close to each other (i.e., pairs with the maximal score – sum(weights)/#of articles where pair occur) and proceed further with them.

Next problem, when the initial seed is filtered, terms are extracted from the text and pairs are determined, is to leverage the syntactic structure of a sentence, i.e. parse tree, to extract possible extensions to the seed ontology and detect possible (subj -> predicate -> obj) triplets, where subj and obj are noun phrases and are possible present in an ontology, and predicate is a possible relations between subj and obj. Here we solve three problems, detect a predicate between suggested pairs of terms, expand terms (e.g., human -> human health), detect a predicate between a given terms and possible other terms that is not in our seed.

The procedure is as follows and shown on the figure below:

  • Split the article into sentences.
  • Parse the sentences with the parser (dependency one)
  • Identify concepts (either form defined list of extract with some state-of-the-art NER)
  • Look up for the aka NpVpNp syntactic structure (the set of patterns can be extended, as it is in [1])
  • Finally compare the predicate (Vp) with the verb ontology and produce triplets and possible concepts expansions (aka bigger noun phrase for the extracted concept, e.g. climate -> climate change).
  • Each triplet extracted are aggregated over the whole corpora and the triplet with the most evidence is suggested to be confirmed by the user.


Together with relation extraction between pair of concepts, we can deal with one concepts and determine possible relations it might be involved in – as it is shown on the example below.


With pink defined concepts are drawn. Green are the concepts that we determined by searching for relations to the “pink concept”, e.g. having concept storm we examined the syntactic structure and found the predicate causes and objects/concept climate change and vice versa.

Overall, motivated by the recent knowledge base construction [2, 3, 4], I’ve looked up hot to build knowledge bases (here is only basics). Talking about the volumes of data that are to be processed, obviously aka MapReduce approach should be further established. Additionally to the scalability, incremental updates to the constantly evolving system should be build-in.

Overall, relation extraction and concepts expansion showed a good performance with ~83 % precision for both tasks, a 40 % recall for the relation extraction task.

[1] K. Fundel, R. Küffner, and R. Zimmer, “RelEx—Relation extraction using dependency parse trees,” Bioinformatics, vol. 23, no. 3, pp. 365–371, Feb. 2007.
[2] F. Niu, C. Zhang, C. Re, and J. W. Shavlik, “DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference.,” in VLDS, 2012, vol. 884, pp. 25–28.
[3] O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan, “Building, Maintaining, and Using Knowledge Bases: A Report from the Trenches,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2013, pp. 1209–1220.
[4] F. Niu, C. Zhang, C. Ré, and J. Shavlik, Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference. .