Wednesday, July 16, 2008

Pronto 0.2 released

As you may have noticed C&P recently released the next minor version of Pronto – our probabilistic DL reasoner built on top of Pellet – Pronto 0.2. In what follows I’ll briefly summarizes the main improvements comparably to 0.1.
First off, Pronto 0.2 – is an optimization/bug fixing release meaning that there are no reasonably important new features (at least as far end users are concerned). We have very much been concerned with performance/scalability issues which were (and some still are) rather glaring in 0.1.
Here I must say that the success w.r.t. performance and scalability were quite different. There’re pretty substantial performance improvements, i.e. Pronto 0.2 is more predictable and few times faster than Pronto 0.1 when running on the same set of samples. Simply put, if Pronto 0.1 is capable of reasoning with some probabilistic ontology, Pronto 0.2. will do it much faster. How much faster depends on the KB.
Regarding scalability, the progress has been less impressive meaning that it still can’t scale substantially beyond ~15 generic (TBox) conditional constraints. Being a little more technical here: it’s polynomially better than Pronto 0.1 but that doesn’t entirely solve the problem with exponential complexity of probabilistic DL reasoning. More news about the ongoing progress in that direction is expected towards the end of summer 2008.
Now let’s turn to the nature of the performance improvements. The main is the new lexicographic entailment algorithm that replaced the original Lukasiewicz TLexEnt procedure [1] that was based on brute force search for the preferred models (recall that Pronto is a non-monotonic reasoner, so it infers knowledge from the set of preferred models of the KB, not from the set of all models). The new algorithm is based on the idea of carefully identifying and eliminating small pieces of the KB that can’t be satisfied by the preferred models. This is similar to computing maximal satisfiable (minimal unsatifiable) subsets of ontology – techniques that are used for debugging classical OWL ontologies. The algorithm helps to make reasoning a lot more predictable and robust, i.e. reduces the difference between the “easy” ontologies and “hard” ones that previously required 10-15x reasoning time (in spite of having the same size). More detailed technical description of the algorithm can soon be found in our paper that was recently accepted to ISWC’08 [2].
There are also other optimization techniques most of which try to reduce the excessive amount of classical DL reasoning performed during the probabilistic entailment.
Improvements (i.e. robustness and speed-up) are illustrated on the following diagram (taken from the ISWC paper). It compares the performance of Pronto 0.1 and Pronto 0.2 on a set of randomly generated probabilistic ontologies. Y axis shows the reasoning time and X axis shows the metric that we developed to assess “hardness” of specific probabilistic KBs before reasoning with them. Basically we wanted to understand what makes KBs hard for reasoning and it seemed that “connectivity”, i.e., number of all possible subsumption relations between OWL classes in probabilistic statements might matter (the higher connectivity - the easier must be the KB). The results show that in addition to simply being faster Pronto 0.2 is also more robust as there are no “hard outliers” – KBs that take a lot more reasoning time comparably to others of the same size. And it’s more predictable, i.e. by using the metric (which is fast to compute) one can roughly estimate reasoning time (if using Pronto 0.2).
I am hoping to make the evaluation results repeatable by other people. To achieve this we started a small project PREVAL_DL which is a reasoner-independent tool for evaluating the performance of P-SHIQ(D) reasoners. It can generate instances of probabilistic reasoning problems and run your favorite P-SHIQ(D) reasoner through them (Pronto and ContraBovemRufum are the only available so far).

We are currently working on the scalability problem. Hopefully soon we’ll have a better idea of how large probabilistic ontologies can still be reasoned with. Scalability improvements after that point will likely be concerned with partitioning/decomposing probabilistic KBs onto smaller probabilistically independent fragments.


[1] T. Lukasiewicz, Expressive Probabilistic Description Logics, Artificial Intelligence, 172(6-7), 852-883, April 2008

[2] P. Klinov and B. Parsia, Optimization and Evaluation of Reasoning in Probabilistic Description Logic: Towards a Systematic Approach, accepted to the International Semantic Web Conference, Karlsruhe, October 2008

Friday, November 23, 2007

Few extra features for P-DL

I'm starting using P-DL as an abbreviation for Probabilistic DL. Hope that's clear enough.

In my last post I whined about the complications of adding complex probabilistic formulas to P-DL without any good justification that they are actually needed. Let me now propose something else that seems to be easier to implement and might even be more useful.
Recall that Pronto needs numbers (intervals) for each of its conditional constraints. In some cases, however, it might make sense to express *relative* probabilistic constraints w/o any reference to numbers. For example: Given the symptoms, the chance of having disease A is twice as high as the chance of having B. So, we may not be willing to commit to numbers (or may not even know the numbers!) but still want to express the constraint. Why not to allow this?
This can be represented by expressions of the form:
k1*P(psi1|phi1)[l1,u1]+...+kn*P(psin|phin)[ln,un] >= 0
The semantics is also pretty clear - we just need to impose an extra restriction on the possible probability distributions. But this is just *one* inequality so allowing such expressions is no harder than supporting constraints themselves. That's good.
Next step - to implement it? Yes, I need to sort out nominals first.. =yawn=

Follow-up on complex formulas

In my previous post I raised few questions about allowing complex probabilistic formulas in probabilistic DL. The main issue is that neither P-SAT nor TLogEnt seem to be tractable because of no obvious translation to a single LP instance would do the job.
It can, however, be solved by a number of LP instances. Basically, the idea (as suggested by Lukasiewicz in private communication) is to convert the formula to DNF and non-deterministically solve N LP instances (where N is the number of disjuncts).
This doesn't look tractable. To make things even worse, some of conjuncts might be negations which means even a single disjunct might have to be broken onto several LP instances! For example, assume we have smth like:
(psi1|phi1)[l1,u1] \or ((psi2|phi2)[l2,u2] \and \neg{(psi1|phi1)[l3,u3]})
The way this can be handled is to generate 3 LP instances:
- check if LP for (psi1|phi1)[l1,u1] is solvable. If yes, then stop and answer "yes"
- if no, check that LP for (psi2|phi2)[l2,u2] is solvable. If no, stop with "no"
- if yes, check that LP for (psi3|phi3)[l3,u3] is solvable. If no, stop with "yes", otherwise - stop with "no".
Horrible, isn't it? Of course, there might be heuristics as to minimize the number of LPs to be solved. In addition generated LP instances can be reused. Pronto experience shows that generation of an LP problem is a lot more expensive than to solve it.

Thursday, November 01, 2007

Probabilistic Description Logic and Complex Formulas

It's been a while since I implemented the first version of Pronto and only now I realized that P-SHIN (which is a basic of Pronto) doesn't allow complex probabilistic expressions (call them formulas or axioms, assertions, whatever). What I mean by complex expression is simply a combination of conditional constraints linked using conventional logical connectives - and, or, neg, impl, with obvious relationships. Consider the example:
In Pronto, we can say that for some probabilistic individual, its probability of being a penguin is 90% - (Penguin|owl:Thing)[0.9;0.9] (so called PABox axiom). We can also say smth else, like (Alive|owl:Thing)[0;0.1]. It's pretty easy to observe that both are in fact equivalent to (Penguin|owl:Thing)[0.9;0.9] \and (Alive|owl:Thing)[0;0.1] according to the obvious semantics of \and. Okay, but why shall we stop here and not allow something like: (Penguin|owl:Thing)[0.9;0.9] \or (Alive|owl:Thing)[0;0.1]? Or even (Penguin|owl:Thing)[0.9;0.9] -> (Alive|owl:Thing)[0;0.1]? In other words, why not to built complex expressions from the conditional constraints similarly to how propositional formulas are built from atoms.
Something similar was proposed by Lukasiewicz in his paper "Probabilistic Default Reasoning with Conditional Constraints" where he inductively builds so called strict probabilistic formulas from strict conditional constraints. Note, that he allows only strict constraints to participate in such expressions and not defaults (probably because it's semantically unclear what should be a conjunction/negation of defeasible statements). Okay, given that PABoxes in Pronto never have defeasible constraints, we can do the same.
However, in his paper on Probabilistic DL P-SHOQ, Lukasiewicz did not propose that. Why? Well, there might be several reasons. The first (and the most obvious) is that it's not quite clear if satisfiablity of a probabilistic default theory *plus* of a set of such complex probabilistic formulas can be decided more or less efficiently (yeah, I know, that it's hard to speak about efficiency when dealing with huge linear systems, but still..). For example, given the semantics of "I|=A or B iff I|=A or I|=B" where both A a and B are conditional constraints, it seems that it might be necessary to non-deterministically solve two linear systems . Negation is also tricky (though can be omitted for the moment).
So, seems to be an interesting research question. But more importantly, do we actually need such complex probabilistic ABoxes? Would someone from, say, medical domain, be interested in the ability to express something like: "either John has flu with probability at least 90% or he fakes his illness with probability between 70% and 80%"? If yes, then we can try to think about reasonably efficient generalizations of satisfiability/entailment procedures.

PS. Hum, Lukasiewicz doesn't say anything about deciding satisifiability of strict probabilistic formulas w.r.t. probabilistic default theories in "Probabilistic Default Reasoning with Conditional Constraints". Maybe he did elsewhere?

PPS. Does any one know an easy way to generate pictures from TeX expressions and embed them into blog posts similarly to Wikipedia pages?

Sunday, October 21, 2007

[Pronto] on its ability to generalize

When reading a perfect paper-survey by Simon Parsons on current approaches to handling imperfect knowledge in DB and KB [1], I noticed a fairly interesting example. It says: "... Amarger et al. [2] are essentially doing a series of propositions, whose relationships are stated using imprecisely stated probabilities, and inferring new probabilistic relationships between them. Thus given the information that between 70% and 90% students do sport, and between 85% and 90% of students are young, they show how to conclude what proportion of young people do sport..."
This got me curious. Can Pronto do something like that? Formally, given that (Sportsman|Student)[0.7;0.9] and (Young|Student)[0.85;0.9] can it conclude anything meaningful about (Sportsman|Young)[l,u]? Well, quick check revealed that it can't, that answer was [0;1] - _cant_say_anything_ which is expected, of course. Note, that neither Sportsman nor Young is a sub-class of Student, so why would we want to conclude anything about the relationship between potentially broader classes? From a very cautious, rigorously probabilistic viewpoint, we cannot. What if the probability of Student is zero, for instance?
Another query would be something like (StudentSportsman|YoungStudent)[l,u] (i.e. same thing but Sportsman and Young converted to obvious subclasses of Student). This query results in [0.7;0.9] which isn't a surprise either. Indeed, suppose we get a random individual who is a young student. What do we know about him? Well, he's a student. Anything else? Can we gain something from (YoungStudent|Student)[0.85;0.9]? Nope. So, we're forced to an unpleasant conclusion that the result will be fully determined by (StudentSportsman|Student)[0.7;0.9].
This is somewhat discouraging - yet another evidence that probabilities don't always produce the results that humans would expect. Normally, having a sample data, e.g., students and knowing how many of them are young and/or do sports, you might get a rough estimate about youngsters in general. It's debatable whether it would be a good estimate, but you *do get* some idea. Here, you get nothing because of being too cautious.
Does it mean that Pronto entailment relation is too weak? Well, it may or may not be, depending on what it's used for.

[1] S. Parsons, "Current Approaches to Handling Imperfect Information in Data and Knowledge Bases," IEEE Transactions on Knowledge and Data Engineering, 8(3), pp. 353-372, 1996
[2] A. Amarger, D. Dubois and H. Prade, "Constraint Propagation with Imprecise Conditional Probabilities, " Proceedings of the 7th Conference on Uncertainty in Artificial Intelligence, pp. 26-34

Thursday, October 18, 2007

Comments on HCLS/Uncertainty Use Cases

As some people might know from my Clark&Parsia weblog post, I've spent my summer practical training on extending Pellet with probabilistic capabilities. Now, we're trying to promote the tool by showing how it might help people in real-life cases. Some of such cases were summarized by W3C Semantic Web Healthcare and Life Sciences Interest Group. Let me comment on those:

# Hypothesis Uncertainty
* Mutations in the alpha synuclein could cause Parkinsons Disease

The problem here is how to quantify "could cause". As long as this is done, a constraint can be added using evidence class "AlphaSynucleinMutation" and conclusion class "ParkinsonDisease". Pronto supports 2nd level uncertainty (meta-uncertainty - uncertainty in probability estimation), so this should not be a huge problem.

* Hypotheses of relationships based on statistical analysis of microarray data associated with p-values, confidence intervals, etc.
* Gene Ontology Evidence codes in support of a particular GO annotation of a gene
* Evidence classes in the OBO Evidence Ontology

It would be interesting to take a look at the ontology, but it seems that evidence classes if turned into OWL classes might serve as perfect evidences for conditional constraints. Hypotheses will be conclusions.

# Interpretation/Classification Uncertainty

* The patient has elevated cholesterol based on his reading of X mg/dl
* Given the same set of symptoms, Doctor X and Y come up with diagnosis of mild and severe disease respectively

Ok, let's say ABCD is our compound evidence class (where A,B,C,D a classes-symptoms), MD - mild disease class and SD - severe disease class. Then, I guess there should by some sort of statistics that could be expressed in a pair of constraints - (MD|ABCD)[l_1,u_1] and (SD|ABCD)[l_2, u_2]. Having an individual _a_, both doctors add a probabilistic ABox assertion (ABCD|Top)[l_a,u_a] for _a_ (or maybe even strict one - _a_:ABCD). Till now everything is perfectly supported by Pronto. Then, Pronto will compute results (MD|Top)[l_amd, u_amd] and (SD|Top)[l_asd, u_asd]for _a_. The question is, what doctors should conclude? They may either favour one diagnosis to another basing on probabilities or override the probabilities for _a_ (say, (MD|Top)[0.2, 0.4] and doctor X explicitly adds (MD|Top)[0.7, 0.8] for _a_). Having enough number of *overriden individuals* one may even adjust generic constraints (MD|ABCD) and (SD|ABCD)

* True/False Positive/Negative rates of patient classifications and diagnoses. Use of measures such as Precision, Recall, PPV, NPV, etc.

Not sure I understand this

# Prediction-oriented Uncertainty

* A person with the BRCA1 gene has a disposition towards Breast Cancer with 70% probability in the future

See BRCA model

# Belief oriented uncertainty

* It is believed to the best of our knowledge that a particular gene is not implicated in a particular disease

This is the problem of independence that is not currently captured in Pronto by any means. Basically, it doesn't hurt the representation (besides of making it less explicit). As long as there're no constraints connecting the gene and the disease (even indirectly), no knowledge about the presense of the gene affects the reasoning results concerning the disease. The problem is performance because the reasoning engine _doesn't know_ that the gene is irrelevant to the disease and will take it into account when constraining the probability interval on the disease.

* Associated non-monotonicity with the above, i.e., if more knowledge is available, the statement could be proven false.

Pronto is inherently non-monotonic, i.e. it supports defeasible knowledge. Unfortunately, as long as independence can't be explicitly represented by an axiom, the axiom can't be defeated :) Instead, if the gene _is known_ to be implicated in the disease, this statement might be overriden (defeated) for some particular individual (for whome we know something else, say, the presense of another gene that diminishes the harm of the former).

I must say that Pronto currently doesn't support any belief updating mechanism, e.g. accumulated knowledge about certain individuals doesn't by any means affect generic default knowledge. I can imagine this to be desirable. This might be implemented on top of the core representation and reasoning servises. This also correlates with learning conditional constraints from data.

# Data Source based Uncertainty

* Samples from the same patient are analyzed by different labs. Lab 1 results show an 80% probability of Disease 1, whereas Lab2 shows a 90% probability for the same.

That's the perfect example of uncertainty intervals: constraint (Disease1|Evidence)[0.8;0.9] would naturally model the case.

* If the Cleveland Clinic says that Avandia is bad for Diabetes, the statement has a higher value of certainty as opposed to an individual Dr. X

Pronto operates with intervals (see the prev. bullet) but it doesn't assume any bias in distribution of actual probabilities w/in the intervals. In other words, if the probability that A is subsumed by B is w/in [l,u], it means that it can be anywhere between l,u and the representational language doesn't allow to specify whether it's "more closer" to _l_ or to _u_. Equipping Pronto with some predefined set of distributions, e.g., normal, etc., might be an interesting problem. Obviously, it won't affect satisfiability and consistency, but we might be able to compute smth like expectations for inferences.

# Data Uncertainty

* Approximate location of a clinical feature, e.g, tumor in spatial location in the human body as captured in radiological image or any other digital artifact
* Data inconsistency and incompletenes encountered in Healthcare and Drug Databases

This is again related to learning certainty intervals from data which can be incomplete, say, in a relational table some attribute values are either unknown (missing) or unreliable. One way of computing probabilities from such data is to use the approach similar to rule induction from incomplete information tables (see http://lightning.eecs.ku.edu/c95-perugia.pdf) Basically, if we could come up with approximations of classes based on the data we have, we can approximate the subsumption relationship between them and represent lower and upper approximations as probabilities. After that Pronto should be able to reason.

* Data uncertainty introduced due to sampling errors, sampling rates, etc.)
* Data uncertainty introduced due to the limitations (least count error?) of the device measuring patient characteristics (e.g., temperature)
* Data uncertainty introduced due to limitation of instruments used to collect experimental data, e.g., micro-arrays

All these are related to handling incompleteness. In general, this situation is often referred to as _granularity_ of knowledge. It's barely feasible to capture the knowledge on the finest granularity level due to measurement limitations - that's one of the reasons we have intervals for probabilities, not just probabilities. These questions aren't directly related to Pronto but are rather more generic issues of quantifying and representing uncertainty as well as computing approximate concepts from uncertain data (see, e.g., http://www2.cs.uregina.ca/~yyao/concept_lattice/fca_app.pdf)

Me back again

Yeah, it's been another 8-9 months since my last post. I must admit that I'm not very organized and can't write here on a regular basis (as I should because posting certainly helps in organizing and structuring ideas). Oh well.. Another try?

Sunday, January 21, 2007

Mereology: Set and Class

...just wondering if anyone is proficient in Lesniewski mereology and understands the motivation behind the "set-of" property? Let me remind you:
Object x is said to be a set of objects with property m iff:
for any y if y-object-m and y-ingr-x then there exist objects z and t such that:
  1. z-ingr-y
  2. z-ingr-t
  3. t-ingr-x
  4. t-object-m

So it can be read smth like:

x is a set of objects with property m iff for any ingredient of x having the property m, it has to have an ingredient that is an ingredient of that part of x that has property m.

Yeah, don't tell me it' horrible. I know. So, what was Lesniewski intuition when he came up with it?

Homepage

It finally happened! My new homepage (http://homepages.uc.edu/~klinovp) is now accessible. Yes, there're lots of "under construction" pages, but don't worry, one day they'll in place too.

Monday, January 15, 2007

Welcome back, Pavel!

well, at some point I started thinking that I'd never be back... No because of being incredibly busy, it's just laziness that was holding me back. And nonetheless I'm back!
Smth has happened during my absence here. The biggest thing is that I've passed PhD Qualifier and now can dive deeply in my research. Yes-yes, I outscored all other PhD students in Computer Science and Engineering! Are they too bad or am I exceptionally good? Neither is true, I suppose... It just happened, it's in the past now, it's a part of the history.
Hopefully, I'll be more regular here now. We'll see.