Probabilistic Inference

Next: General Dependency Networks Up: Consistent Dependency Networks Previous: Definition and Basic Properties

Probabilistic Inference

Given a graphical model for ${\bf X}$ , we often wish to answer probabilistic queries of the form $p({\bf y}\vert{\bf z})$ , where ${\bf Y}$ (the ``target'' variables) and ${\bf Z}$ (the ``input'' variables) are disjoint subsets of ${\bf X}$ . This task is known in the graphical modeling community as probabilistic inference. An important special case of probabilistic inference is the determination of $p({\bf x})$ for given instances ${\bf x}$ of ${\bf X}$ --a key component of density estimation, which has numerous applications. Given a consistent dependency network for ${\bf X}$ , we can perform probabilistic inference by converting it to a Markov network, triangulating that network (forming a decomposable graphical model), and then applying one of the standard algorithms for probabilistic inference in the latter representation--for example, the junction tree algorithm of Jensen, Lauritzen, and Olesen (1990). Alternatively, we can use Gibbs sampling (e.g., Geman and Geman, 1984; Neal, 1993; Besag, Green, Higdon, and Mengersen, 1995; Gilks, Richardson, and Spiegelhalter, 1996), which we examine in some detail. First, let us consider the use of Gibbs sampling for recovering the joint distribution $p({\bf x})$ of a consistent dependency network for ${\bf X}$ . In one simple version of Gibbs sampling, we initialize each variable to some arbitrary value. We then repeatedly cycle through each variable $X_1, \ldots, X_n$ , in this order, and resample each

according to $p(x_i\vert{\bf x}\setminus x_i)=p(x_i\vert{\bf pa}_i)$ . We call this procedure an ordered Gibbs sampler. As described by the following theorem, this ordered Gibbs sampler recovers the joint distribution for ${\bf X}$ .

Theorem 2: An ordered Gibbs sampler applied to a consistent dependency network for ${\bf X}$ , where each is finite (and hence discrete) and each local distribution $p(x_i\vert{\bf pa}_i)$ is positive, defines a Markov chain with a unique stationary joint distribution for ${\bf X}$ equal to $p({\bf X})$ that can be reached from any initial state of the chain.

Proof: Let ${\bf x}^t$ be the sample of ${\bf x}$ after the $t^{th}$ iteration of the ordered Gibbs sampler. The sequence ${\bf x}^1,{\bf x}^2,\ldots$ can be viewed as samples drawn from a homogenous Markov chain with transition matrix ${\bf P}$ having elements ${\bf P}_{ij}= p({\bf x}^{t+1}=j\vert{\bf x}^{t}=i)$ . The matrix ${\bf P}$ is the product ${\bf P}^1 \cdot {\bf P}^2 \cdot \ldots \cdot {\bf P}^n$ , where ${\bf P}^k$ is the ``local'' transition matrix describing the resampling of according to the local distribution $p(x_k\vert{\bf pa}_k)$ . The positivity of local distributions guarantees the positivity of ${\bf P}$ , which in turn guarantees the irreducibility of the Markov chain. Consequently, there exists a unique joint distribution that is stationary with respect to ${\bf P}$ . The positivity of ${\bf P}$ also guarantees that this stationary distribution can be reached from any starting point. That this stationary distribution is equal to $p({\bf x})$ is proved in the Appendix. $\Box$

After a sufficient number of iterations, the samples in the chain will be drawn from the stationary distribution for ${\bf X}$ . We use these samples to estimate $p({\bf X})$ . There is much written about how long to run the chain before keeping samples (the ``burn in'') and how many samples to use to obtain a good estimate (e.g., Gilks, Richardson, and Spiegelhalter, 1996). We do not discuss the details here. Next, let us consider the use of Gibbs sampling to compute $p({\bf y}\vert{\bf z})$ for particular instances ${\bf y}$ and ${\bf z}$ where ${\bf Y}$ and ${\bf Z}$ are arbitrary disjoint subsets of ${\bf X}$ . A naive approach uses an ordered Gibbs sampler directly. During the iterations, only samples of ${\bf X}$ where ${\bf Z}={\bf z}$ are used to compute the conditional probability. An important difficulty with this approach is that if either $p({\bf y}\vert{\bf z})$ or $p({\bf z})$ is small (a common occurrence when ${\bf Y}$ or ${\bf Z}$ contain many variables), many iterations are required for an accurate probability estimate. A well-known approach for estimating $p({\bf y}\vert{\bf z})$ when $p({\bf z})$ is small is to fix ${\bf Z}={\bf z}$ during ordered Gibbs sampling. It is not difficult to generalize the proof of Theorem 2 to show that this modified ordered Gibbs sampler has a stationary distribution and yields the correct conditional probability given a consistent dependency network. When ${\bf y}$ is rare because ${\bf Y}$ contains many variables, we can use the independencies encoded in a dependency network along with the law of total probability to decompose the inference task into a set of inference tasks on single variables. For example, consider the dependency network $\left[ X_1 \ \ X_2 \leftrightarrow X_3 \right]$ . Given the independencies specified by this network, we have

$\begin{displaymath} p(x_1,x_2,x_3) = p(x_1) \ p(x_2) \ p(x_3\vert x_2), \end{displaymath}$

and can obtain

by computing each term separately. The determination of the first term requires no Gibbs sampling--the distribution can be read directly from the local distribution for

in the dependency network. The second two terms can be determined using modified ordered Gibbs samplers each with a singleton target (

and

, respectively). Note that this approach has the additional advantage that some terms may be obtained by direct lookup, thereby avoiding some Gibbs sampling. In general, given a consistent dependency network for ${\bf X}$ and disjoint sets of variables ${\bf Y}\subseteq {\bf X}$ and ${\bf Z}\subseteq {\bf X}$ , we can obtain $p({\bf y}\vert{\bf z})$ for a particular instance of ${\bf y}$ and ${\bf z}$ as follows:



                  
Algorithm 1: 


1 		    (* the unprocessed variables *) 


2 		    (* the processed and conditioning variables *) 


3 		    (* the values for  *) 


4 		 While  


5 		 		 Choose  such that  has no more parents in  
        than any variable in  


6 		 		 If all the parents of  are in  


7 		 		 		  


8 		 		 Else 


9 		 		 		 Use a modified ordered Gibbs sampler to determine  


10		 		  


11		 		  


12		 		  


13		 Return the product of the conditionals

The key step in this algorithm is step 7, which bypasses Gibbs sampling when it is not needed. This step is justified by Equation 1.

Next: General Dependency Networks Up: Consistent Dependency Networks Previous: Definition and Basic Properties

Journal of Machine Learning Research, 2000-10-19