Jul 162014

I am heading straight into the candidacy exam in a month. It is quite a challenging time and therefore I wanted to start writing about one of the papers I chose for the exam – Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference.

Elementary is a Knowledge Base Construction (KBC) that encapsulates a range variety of resources and techniques under one umbrella – statistical inference. This system employs Markov logic, probabilistic graphical models, to incorporate both rule based approach and ML (such as Conditional Random Fields) to construct the KB with a certain uncertainty ;). On the other hand, it leverages the dual decomposition to decompose the optimisation functions as a trade-off between uncertainty and scalability. Here, they try to optimise the overall cost of the system, which is cost of having false positive and false negative inference.

The system is claimed to be a universal framework for KBC, however, it is not clear how some problems, apart from classification, segmentation and clustering, could be solved. The main intuition for this statement, is that the only problems that can be solved are those that have Markov Logic Network(MLN) representation. Apart from this, it is also unclear how the further maintaining and updating is performed, as it is well known that KBs are tend to evolve rapidly over time. Moreover, it does not support ‘cold’ start as, first, entities should be manually chosen, second, relation extraction is fully relies on the distant supervision. Basically, concept extraction should be managed by either user provided functionality or user should be an expert in the field and know the most interesting entities.

High-level process of KBC is as follows:

  1. Entities and relations should be chosen prior to the execution
  2. Corpus of the documents should be chosen
  3. This is a feature extraction (FE) step. Basically, we are up to the filling two tables: entities mentions, relation mentions (not only relations we want to infer but also some that will help to deduce inferred relations).
  4. This step manipulate with the collected evidence and put it all to the MLN representation, domain rules, ML (like CRF, logistic regression, clustering etc.). Finally, rule weighting is performed
  5. Now statistical inference over the weighted MLN can be run. As a result, we make inference on the parts of the future KB
  6. Iterate over 3 and 4 again if the results are not satistiable

And now some more details about the system and underlying approach.

As it was mentioned earlier the system relies on the MLN and consists of three parts:

  • schema is used to specify which data will be provided and generated, together with the information on the instances that are to be inferred.
  • evidence is a database representation of a schema, i.e., instances of the schema tables, or in other words, presence of the schema in the corpora (extracted concepts, co-occurrence, relations etc)
  • rules serve two purposes, constrain the inference (from domain knowledge etc) and guide the inference process (conclusions from the first order logic representation of the rules). Rules are weighted by the level  of certainty it guarantees (could be either learned or heuristically set).

Knowing what we would like to get from the system, a lot of tasks could be classified to one of the problem classes: classification(logistic regression), segmentation (conditional random fields), correlated clustering. Aforementioned tasks can be split into smaller subtasks and reduced together (when their subtasks are solved independently) with certain level of confidence. Subtasks reduction is performed with dual decomposition []. This gives us flexibility in two ways:

  • split bigger tasks into smaller one’s to reach significant speed up and efficiency
  • different/multiple input sources could be treated by different processed and further combined

Therefore, both conflicting input sources and scalability can be solved simultaneously.

Finally, rules, constrains, execution paths with different inputs are represented as MLN. Apparently, in order to accounted, each component should be represented in first order logic and any adjustment to the rules and immediately applied with the further processing, i.e., optimal path to the inferred variable will ensure each rule satisfiability with certain level of confidence (rule weight). Statistical inference is applied on top of the MLN representation so that guaranty the most probable inference for a variable. First, the rules can be scoped, i.e., the number of rules can be minimised by manually adding a scoping rule to reduce the space. For example, you can specify that there are N-1 adjacent pairs for the query Adjacent(element, element). Second, a weight for the rule might be set proportional to some of the parameters in the rule, e.g., making a rule more valuable if it contains the mention of the word “Obama”.

Evidence is a presence in the corpora any of the schema items. In other words, if we have co-occurrence predicate in the schema we would like to obtain co-occurrence statistics for concepts/instances. Each document treated as a set of possible overlapping text spans. Based on this notion, three tables are populated (evidences):

  • entity-mention M_E for each entity E, e.g., PERSON(Julia, TextSpans(doc_i, positions)). This table is filled for the entities that are chosen by the user
  • relationship-mention R_E for each relation (with the arity of the relation representation, e.g., Mention(a, b) – has arity 2)
  • relationship table which can be obtained from the two previous tables.
  • Finally, each text span is characterised with the feature vector: F_m(T[E], V_m), where V_ms are the features and T[E] is the list of text spans [entities].

Apparently, the necessity to have features puts a requirement to have annotated data, which could be filled with the distant supervision patterns.

The system basically leverages three main subproblem, i.e., entity linking and relation extraction on top of former and domain knowledge incorporation. Last subproblem is mainly manually curated unlike the first two due to their nature.

Entity extraction example: canonical names, string matching, regular expressions and NER (named entity recognition) can be used as a source of features for the MLN. After extracting the features, rules for the entity linking could be determined. Constrains could be set as well, e.g., if mention is an entity -> then it should have been detected by the NER etc.

Relation extraction could be simplified to the following. Commonly classification on the linguistic patterns is used. Among features are the following: pattern similarity, frequency based filtering, l1-norm regularisation [1]. Multiple approaches could be incorporated, however, after the rules for the inference should be specified alike in the entity extraction example.

To make it clearer let’s consider an example. We have a set of rules where x is a key phrase and y is a label for the sentence. We would like to infer if a “Ronaldo” is in a sentence that the other key-phrase is also sport related :

  1. Mention(x,y) and x == “Ronaldo” => SportMention(x, y)                                   | w = 4
  2. SportMention(x, y) and Mention(x1, y) and x1 != x=> SportMention(x1, y)  | w = 3
  3. Mention(x_i, y) and each x_i != “Ronaldo” => not SportMention(x1, y)        | w = 10

Let’s say we are interested in inferring SportMention having a set of evidences, i.e., we would like to deduce a relation SportMention between x and y having a set of rules/constrains and set of evidences.

Set of rules can be represented as a MLN with activation functions proportional to the weight of the rules (btw, rule weights can be either learned or heuristically set – here I did it heuristically and proportionally to the level of believe I have in the rule:)). Basically, having an annotated corpus we would like to compute the misses

Screen Shot 2014-07-23 at 21.46.46

At this stage, we can simply impose the following problem: which node inference has the most weight and which combination of rules minimises the overall error for some training examples, knowing the dependencies and constrains.

Coming back the core subproblems of the system: classification, segmentation and correlated clustering. These are the approaches that are matched to the problems in the KBC and that enable decomposition of the problem. So that, KBC could be solved in parallel.

[1] Zhu, J., Nie, Z., Liu, X., Zhang, B., & Wen, J. (2009). Statsnowball: A statistical approach to extracting entity relationships. In Proceedings of the International Conference on World Wide Web (pp. 101–110).