I guess it kinda makes sense to try to summarise what I've been doing for the last two years. I'm not expecting anyone to seriously read this (since the only people who are supposed to be interested are my advisers but they know all that stuff anyway) but it may help me sorta look back and re-assess my efforts.
So in Summer 2008 it became fairly obvious that we needed a principally new approach to make Pronto usable (I'm not using the word *scalable* here). At that time Pronto was able to deal with 15-18 probabilistic statements. There was no hope to handle more because the linear systems constructed in the process of reasoning were *always* exponential. So it wasn't a problem of solving them, it was a problem of even representing them in memory. All attempts to shrink their size appeared futile.
But then we came across column generation. Its idea is that one does not need the entire linear system to be computed before solving it. Simply put, you get some subset of variables, solve the restricted system, check if it's optimal (i.e. if the solution is the same as would be for the full system) and if not - you generate and add new columns. Otherwise you stop. Obviously this may work iff there's a hope of stopping early, i.e. before you generate an exponential number of variables. Fortunately, the theory of LP gives us that hope: any optimal solution contains only a polynomial number of non-zero variables. Since a variable being zero is the same as non-existent variable, we only need to generate a polynomial number of variables. In the ideal world.
In the real world there're two problems. First, you don't know which variables you need. Second, it's hard to generate those columns which are good (or are likely to be good). There's a whole theory of column generation talking about how to solve these two problems. We'll call the first problem The Convergence Problem (CP) and the second - The Column Generation Problem (CGP). One thing to notice here is that they are related: the better you solve CGP (i.e. the better columns you generate), the earlier you'll stop (i.e. the convergence is faster).
So we started looking into column generation in Fall 2008. At first it wasn't clear what those columns (or variables) represent. They actually correspond to *possible worlds* over which probabilistic models are defined. So the CGP problem could be formulated as the problem of generating possible worlds which are likely to have non-zero probability for a given KB and the query (entailment).
At that point I stumbled at the first major difficulty: I needed to generate not just worlds (which are basically conjunctive class expressions over the relevant signature) but *possible* worlds or expressions that would be satisfiable given the KB. Well, a column is an array of rationals (i.e. vector of coefficients). It corresponds to a class expression. Formulating constraints on those rationals which would ensure satisfiability of the expression is NOT easy. Their structure depends on the KB and it was unclear if they could be captured using a set of linear inequalities.
But we found the way to go. The basic idea was that we don't necessarily need all those constraints in advance. We could generate a column, get the corresponding class expression and check its satisfiability. If it's not satisfiable then we could find out why (i.e. get the combination of rationals that cause the unsatisfiability) and prohibit such combination using a linear inequality. This is our hybrid "generate-n-check" column generation algorithm.
The second big headache was the CGP itself. It was originally formulated a constraint optimization problem (COP). The good news was that its size was polynomial in the number of probabilistic statements. The bad news was that it was still too computationally hard. Our constraint solver never got sufficiently scalable despite using all sorts of advanced algorithms (like variations of the Russian Doll Search, partitioning, etc.). Bounding in the branch-n-bound algorithm was crap. Pronto could scale up to 50 constraints or so. Still a nice improvement over 15-18 but pretty frustrating given 3-4 months that went into it. Spring 2009 looked pretty bleak.
And then came the break through. I'm deeply grateful to Dr. Paul Rubin - an operation research scientist - whose invaluable help on the Google op-research board led me to the idea of finding a better formulation for the CP problem. Something easier than COP. More scalable. Something for which good algorithms and tools were available. Like MILP (mixed-integer linear programming) with all the highly optimized solvers, like CPLEX, GLPK, MOSEK, GAMS, etc-etc.
It felt great and weird. Great because Pronto started to fly (er.. comparatively to the previous versions). Weird, because all my COP efforts were worthless (yeah, got skills and experience but still...)
There was still a lot work since then. We've implemented inconsistency analysis and pinpointing algorithms which would find all minimal sets of inconsistent probabilistic statements (a non-trivial stuff). We've improved the non-monotonic entailment algorithms. We've put a lot of efforts into solving the convergence problem (which still kicks my ass, to be honest). We've investigated theoretical aspects of P-SHIQ (our probabilistic logic) and showed its relation to the standard first-order probabilistic logic.
I should find time to blog on all these things. That's all for now.
2 comments:
Great post; good to catch up on all this work. The way you tell the story of the work it has a really nice, engaging, almost suspenseful quality.... Looking forward to more.
Good review. Two years in >2000 words ;)
Post a Comment