Jun 212014
 

I haven’t posted for a while, I hope to change it these days:)

I am continuing working on amazing project about scientific knowledge conceptualisation. The first step on the way to reach nirvana with my project is to create a methodology for ontology creation ( aka __repr__ of the knowledge base). Talking about knowledge base and its construction will clearly lead to the problem of disambiguation and necessity of removing duplicates in the knowledge base.

Obviously, construction of the ontology expects the presence of different sources as possible input, e.g., different classifications, glossaries, concepts from various papers. All those sources might produce equivalent concepts but represent/name them differently.

Here, I would like to talk about the simplest approach to resolve disambiguations, misspelling various way to write the same thing – syntactic resolution. Conflicts resolution for multiple source of information on the syntactic level includes the following: detect whether two concepts are the same based on the syntactic structure of the aforementioned.

To tackle this problem, I’ve developed a tool for multiple sources duplicate deletion with human curation upon uncertain pairs of concepts. The tool works as follows:

  • Create an index over all possible words appearing in the concepts (with the following components: its name, its stemmed form, form without special symbols), aka inverted index but with pointers to the term in which the word appeared.
  • List of possible term candidates is created for every pair that is suspicious.
  • Each pair is scored according to the following criteria:
    • Number of equal words (lemmatised)
    • Order of the words, or how the position of words differs
    • Absence or presence of extra first or last word, as it might change the meaning a lot, e.g., Weather and Weather Model, Learning and Adaptive learning.
    • One of the things that is missing is synonyms check

Created set of terms is searched in the article’s corpora. Here, whole article text is normalised (lemmatised, simplified(w/o stop words) etc.) together with the terms in the initial seed. Each term occurrence in the article is detected and accumulated in the maps, e.g., {norm_term: [(doc, start_pos, end_pos, context), …]}

When the occurrences are retrieved the most related pairs are detected, simply by computing a weighted co-occurrence for a pair:

  • Each pair (a, b) can acquire a max of 1 weight in an article.
  • If a pair (a, b) appear in one sentence – the weight is increased by 1, else it is increased by a weight proportional to the distance between terms in the article.
  • Finally, each weight is computed as follows:
    • normalised by a min(count(a), count(b)) for accumulated weight that is smaller than this min
    • or else return 1

Here is a basic representation of the graph that will appear after applying the procedure above.

cooccurrence_example

Out of the available pairs generated from the articles, we select the most close to each other (i.e., pairs with the maximal score – sum(weights)/#of articles where pair occur) and proceed further with them.

Next problem, when the initial seed is filtered, terms are extracted from the text and pairs are determined, is to leverage the syntactic structure of a sentence, i.e. parse tree, to extract possible extensions to the seed ontology and detect possible (subj -> predicate -> obj) triplets, where subj and obj are noun phrases and are possible present in an ontology, and predicate is a possible relations between subj and obj. Here we solve three problems, detect a predicate between suggested pairs of terms, expand terms (e.g., human -> human health), detect a predicate between a given terms and possible other terms that is not in our seed.

The procedure is as follows and shown on the figure below:

  • Split the article into sentences.
  • Parse the sentences with the parser (dependency one)
  • Identify concepts (either form defined list of extract with some state-of-the-art NER)
  • Look up for the aka NpVpNp syntactic structure (the set of patterns can be extended, as it is in [1])
  • Finally compare the predicate (Vp) with the verb ontology and produce triplets and possible concepts expansions (aka bigger noun phrase for the extracted concept, e.g. climate -> climate change).
  • Each triplet extracted are aggregated over the whole corpora and the triplet with the most evidence is suggested to be confirmed by the user.

relation_extractino_pipeline

Together with relation extraction between pair of concepts, we can deal with one concepts and determine possible relations it might be involved in – as it is shown on the example below.

example1example2example3

With pink defined concepts are drawn. Green are the concepts that we determined by searching for relations to the “pink concept”, e.g. having concept storm we examined the syntactic structure and found the predicate causes and objects/concept climate change and vice versa.

Overall, motivated by the recent knowledge base construction [2, 3, 4], I’ve looked up hot to build knowledge bases (here is only basics). Talking about the volumes of data that are to be processed, obviously aka MapReduce approach should be further established. Additionally to the scalability, incremental updates to the constantly evolving system should be build-in.

Overall, relation extraction and concepts expansion showed a good performance with ~83 % precision for both tasks, a 40 % recall for the relation extraction task.

[1] K. Fundel, R. Küffner, and R. Zimmer, “RelEx—Relation extraction using dependency parse trees,” Bioinformatics, vol. 23, no. 3, pp. 365–371, Feb. 2007.
[2] F. Niu, C. Zhang, C. Re, and J. W. Shavlik, “DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference.,” in VLDS, 2012, vol. 884, pp. 25–28.
[3] O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan, “Building, Maintaining, and Using Knowledge Bases: A Report from the Trenches,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2013, pp. 1209–1220.
[4] F. Niu, C. Zhang, C. Ré, and J. Shavlik, Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference. .

 

Feb 222014
 

Recently, I was reading a paper about building, maintaining and using knowledge bases. I feel that this paper will influence my future research significantly; therefore, I would like to discuss my point of view and thoughts about it.

To begin with, an ontology as a part of the information science is a formal representation of the knowledge. Having such representation may be beneficial to multiple fields and applications, i.e., search, query understanding, recommendations, advertising, social mining, etc. However, there has been a little or no documentation on a whole life cycle of the ontology, including building, maintaining, and using.

In the paper, authors try to answer numerous questions, i.e., what are the pitfalls of maintaining the large knowledge base (KB), what is the influence of the user on the system, how continuous updates and integration should be handled etc. The following choices are among the most distinguishable decisions:

  • Construction of the global ontology-like KB based on Wikipedia, i.e., the KB attempts to cover the entire world capturing all important concepts, instances and relations). Approach that was chosen for constructing a global KB has obvious disadvantages for domain specific construction. In case of building a domain specific ontology, e.g. computer science (CS), it is  unclear how to limit efficiently the Wikipedia mining process for capturing only CS related topics.
  • Enrichment the KB with additional sources, i.e., requirement to enhance the set of instances of the KB has paved the way for involving additional sources with more specific information. Combining several sources of knowledge always lead to the necessity of aligning/merging. This is a process that require information about the context and human intervention. On the other hand, only limited number of resources were processed to enlarge the KB. In my opinion, mining scientific articles in various domains will enrich the KB even further and will give the necessary depth of human knowledge in state of the art domain.
  • Relationships extraction from the Wikipedia, i.e., wikipedia pages that are connected to the KB concepts analysed extensively with well known natural language processing techniques to get free-form relations for concept pairs. By free-form it is assumed that there is no predefined set of possible relations in the KB. This gives a certain freedom but might limit the search as the number of such relations may grow infinitely.
  • KB updates are performed as a rerun of the KB construction pipeline from the scratch. Clearly, this imposes several disadvantages:
    • To rerun the whole KB construction takes substantial time.
    • As the KB is curated by the human (analyst) it is not clear how and to what extent such curation should be used in the future after the rerun. Moreover, the questions is how to utilize preceding analyst intervention to the newly generated KB. In particular, an analyst might change the edge weights, however, if this edge is not present in the newly constructed wiki DAG (or a node will be renamed) is won’t be utilised.
    • Regardless the construction process, an incremental update of the KB seems to be more logical in terms of speed and work facilitation for the analysts. Additionally, it is unclear how a single person can curate the KB of the entire world. In my opinion, aforementioned problems of curation should be placed on multiple people, e.g., crowdsourced.
    • It is unclear how conflicts are resolved when combining different sources of information, e.g., controversial relations, etc.

Aforementioned system design imposes some limitations. First, relation types are not managed and may lack the expressiveness that might be required for some applications, e.g., explainedBy, modelIn, methodsIn, importantIn, etc. This information can be extracted from not only Wiki pages, infoboxes and templates, but also from templates in the bottom of the page. Second, the DAG construction from the cyclic graph extracted from Wikipedia requires additional verification. The constructed model for the weight dissemination in the cyclic graph includes only three parameters, i.e., co-occurrences of the terms in the Web and in the Wiki lists, and name similarity. Third, KB model might benefit from the scientific paper analysis and integration of this information to the KB. Finally, the system might benefit from the data curation by means of crowdsourcing to improve the accuracy and facilitate user contribution.

Oct 202013
 

The field of Distributed Video Streaming shows a great gap in churn and adversary optimization. Attempts to eliminate adversary influence were made by cool guys from Texas University at Austin in their Bar Gossip version. However, churn is not covered and resolved by the authors.

In our work, we are trying to embrace both dynamicity and adversary behavior of possibly sub-30% nodes.
The video streaming system is build on top of the framework developed by the LPD lab at EPFL. This framework makes use of highly expensive byzantine agreement protocols over small subsets of the whole system – clusters. This approach guarantees both sustainability to adversary and consensus within honest majority of nodes. Additionally, a certain level of randomization is added to handle adversary attacks, e.g. nodes are constantly changing their location in the topology, jumping from one to another cluster.

This constant move gives a certain challenge for maintaining the streaming application.

Background

Previously, streaming systems were classified into two categories: tree-based (push) and mesh based (pull)[link]. However, both of them have its own advantages and disadvantages. The former is not suitable for handling the churn, while the latter system is robust to churn but cannot guarantee the effectiveness of the content distribution.

Even though that the mesh based topology is designed with churn in mind it is not robust agains any adversary influence and vulnerably exposed to the outside world. Oppose to the mesh systems, our framework support dynamic node placement, thus maintaining better fault-tolerance.

At the same time, two main strategies for the content dissemination exist: pull [link] and push. Pull strategies are proven to produce less overhead on the system [link], however, towards fully dynamic and partially corrupted network the pull strategy cannot provide the required fault-tolerance and constant node replacement. According to the main requirements to sustain the performance over dynamic and fault-tolerant system, we sacrify the network overhead property and adopted push strategy.

Another combination of systems are those that actually can handle byzantine adversary and try to sustain dynamicity, e.g., BarGossip[link] and FlightPath[link].

The former, BarGossip, is based on the simple gossip model however it introduces verifiable pseudo randomness for peer selection. The system is easy to implement. The main difference with simple gossip model is its introduction of pseudo random peer selection and usage of credible threats instead of reputation for peers. Credible threats mechanism is based on the fact that upon suspicion any node can send POM (proof of misbehavior message), therefore is rational node thinks that it can be suspected – it might decide to actually forward some messages. The system is implemented in rounds and uses balanced exchange (exchange some info with others if the node received something that it did not had previously in the current round) and optimistic push protocol.

The later, FlightPath, At the same time with fault-tolerance, sustain dynamicity  in the infrastructure and relies on the epsilon-Nash equilibrium. In such equilibrium rational nodes will behave differently if they expect to benefit more that factor of epsilon from such a behavior. In the system the source sends two types of messages: stream update (actual content) and linear digests (authentification of the update). This system also relies on the verifiable pseudo-random algorithm and uses history update messages.

System Model and Problem Definition

We consider a network consisting a dynamic collection of nodes with an adversary model similar to [link]. Nodes have equal roles and  connected to each other in the [link] fashion.

The underlying system architecture looks as follows:

EffectiveStreaming

 

Basically, all nodes are formed in small clusters within which a heavy consensus protocol may run relatively fast. Nodes change their places in the system constantly. Each connection is TCP, which will be updated to UDP in the future. When the source start streaming the data it broadcast it to everyone in the cluster and everyone in the neighboring clusters. Clusters are organized in Hamilton graphs and for the sake of good expandability number of these cycles is redundant (usually 2,3). As you might see it is quite preliminary description.

More details on the implementation and performance evaluation and comparison to other existing systems I am planning to add in the next posts.