As some people might know from
my Clark&Parsia weblog post, I've spent my summer practical training on extending
Pellet with probabilistic capabilities. Now, we're trying to promote the tool by showing how it might help people in real-life cases. Some of such cases were summarized by
W3C Semantic Web Healthcare and Life Sciences Interest Group. Let me comment on those:
# Hypothesis Uncertainty * Mutations in the alpha synuclein could cause Parkinsons Disease The problem here is how to quantify "could cause". As long as this is done, a constraint can be added using evidence class "AlphaSynucleinMutation" and conclusion class "ParkinsonDisease". Pronto supports 2nd level uncertainty (meta-uncertainty - uncertainty in probability estimation), so this should not be a huge problem.
* Hypotheses of relationships based on statistical analysis of microarray data associated with p-values, confidence intervals, etc. * Gene Ontology Evidence codes in support of a particular GO annotation of a gene * Evidence classes in the OBO Evidence Ontology It would be interesting to take a look at the ontology, but it seems that evidence classes if turned into OWL classes might serve as perfect evidences for conditional constraints. Hypotheses will be conclusions.
# Interpretation/Classification Uncertainty * The patient has elevated cholesterol based on his reading of X mg/dl * Given the same set of symptoms, Doctor X and Y come up with diagnosis of mild and severe disease respectively Ok, let's say ABCD is our compound evidence class (where A,B,C,D a classes-symptoms), MD - mild disease class and SD - severe disease class. Then, I guess there should by some sort of statistics that could be expressed in a pair of constraints - (MD|ABCD)[l_1,u_1] and (SD|ABCD)[l_2, u_2]. Having an individual _a_, both doctors add a probabilistic ABox assertion (ABCD|Top)[l_a,u_a] for _a_ (or maybe even strict one - _a_:ABCD). Till now everything is perfectly supported by Pronto. Then, Pronto will compute results (MD|Top)[l_amd, u_amd] and (SD|Top)[l_asd, u_asd]for _a_. The question is, what doctors should conclude? They may either favour one diagnosis to another basing on probabilities or override the probabilities for _a_ (say, (MD|Top)[0.2, 0.4] and doctor X explicitly adds (MD|Top)[0.7, 0.8] for _a_). Having enough number of *overriden individuals* one may even adjust generic constraints (MD|ABCD) and (SD|ABCD)
* True/False Positive/Negative rates of patient classifications and diagnoses. Use of measures such as Precision, Recall, PPV, NPV, etc. Not sure I understand this
# Prediction-oriented Uncertainty * A person with the BRCA1 gene has a disposition towards Breast Cancer with 70% probability in the future See
BRCA model
# Belief oriented uncertainty * It is believed to the best of our knowledge that a particular gene is not implicated in a particular disease This is the problem of independence that is not currently captured in Pronto by any means. Basically, it doesn't hurt the representation (besides of making it less explicit). As long as there're no constraints connecting the gene and the disease (even indirectly), no knowledge about the presense of the gene affects the reasoning results concerning the disease. The problem is performance because the reasoning engine _doesn't know_ that the gene is irrelevant to the disease and will take it into account when constraining the probability interval on the disease.
* Associated non-monotonicity with the above, i.e., if more knowledge is available, the statement could be proven false. Pronto is inherently non-monotonic, i.e. it supports defeasible knowledge. Unfortunately, as long as independence can't be explicitly represented by an axiom, the axiom can't be defeated :) Instead, if the gene _is known_ to be implicated in the disease, this statement might be overriden (defeated) for some particular individual (for whome we know something else, say, the presense of another gene that diminishes the harm of the former).
I must say that Pronto currently doesn't support any belief updating mechanism, e.g. accumulated knowledge about certain individuals doesn't by any means affect generic default knowledge. I can imagine this to be desirable. This might be implemented on top of the core representation and reasoning servises. This also correlates with learning conditional constraints from data.
# Data Source based Uncertainty * Samples from the same patient are analyzed by different labs. Lab 1 results show an 80% probability of Disease 1, whereas Lab2 shows a 90% probability for the same. That's the perfect example of uncertainty intervals: constraint (Disease1|Evidence)[0.8;0.9] would naturally model the case.
* If the Cleveland Clinic says that Avandia is bad for Diabetes, the statement has a higher value of certainty as opposed to an individual Dr. X Pronto operates with intervals (see the prev. bullet) but it doesn't assume any bias in distribution of actual probabilities w/in the intervals. In other words, if the probability that A is subsumed by B is w/in [l,u], it means that it can be anywhere between l,u and the representational language doesn't allow to specify whether it's "more closer" to _l_ or to _u_. Equipping Pronto with some predefined set of distributions, e.g., normal, etc., might be an interesting problem. Obviously, it won't affect satisfiability and consistency, but we might be able to compute smth like expectations for inferences.
# Data Uncertainty * Approximate location of a clinical feature, e.g, tumor in spatial location in the human body as captured in radiological image or any other digital artifact * Data inconsistency and incompletenes encountered in Healthcare and Drug Databases
This is again related to learning certainty intervals from data which can be incomplete, say, in a relational table some attribute values are either unknown (missing) or unreliable. One way of computing probabilities from such data is to use the approach similar to rule induction from incomplete information tables (see
http://lightning.eecs.ku.edu/c95-perugia.pdf) Basically, if we could come up with approximations of classes based on the data we have, we can approximate the subsumption relationship between them and represent lower and upper approximations as probabilities. After that Pronto should be able to reason.
* Data uncertainty introduced due to sampling errors, sampling rates, etc.) * Data uncertainty introduced due to the limitations (least count error?) of the device measuring patient characteristics (e.g., temperature) * Data uncertainty introduced due to limitation of instruments used to collect experimental data, e.g., micro-arrays All these are related to handling incompleteness. In general, this situation is often referred to as _granularity_ of knowledge. It's barely feasible to capture the knowledge on the finest granularity level due to measurement limitations - that's one of the reasons we have intervals for probabilities, not just probabilities. These questions aren't directly related to Pronto but are rather more generic issues of quantifying and representing uncertainty as well as computing approximate concepts from uncertain data (see, e.g.,
http://www2.cs.uregina.ca/~yyao/concept_lattice/fca_app.pdf)