The techniques of a-posteriori image restoration and iterative image feature extraction are described and compared. Image feature extraction methods known as Graduated Non-convexity (GNC), Variable Conductance Diffusion (VCD), Anisotropic Diffusion, and Biased Anisotropic Diffusion (BAD) which extract edges from noisy images, are compared with a restoration/feature extraction method known as Mean Field Annealing(MFA). All are shown to be performing the same basic operation: image relaxation. This equivalence shows the relationship between energy minimization methods and spatial analysis methods and between their respective parameters of temperature and scale. As a result of the equivalence, VCD is demonstrated to minimize a cost function, and that cost is specified explicitly. Furthermore, operations over scale space are shown to be a method of avoiding local minima.
We represent an image by a random process, F, which generates images, and we assume that some image, f, was generated by this process.We now measure that image with some imperfect imaging modality. Noise corrupts this measurement, to produce a measured image g. Given this measured image, we then seek the image, f, which maximizes the posterior conditional probability
. We will refer to our estimate of f by
. In order to solve this problem, we make use of Bayes' rule
where the denominator, Z, is independent of f and therefore does not affect the maximization process. We refer to p(f|g) as the "conditional density" and to P(f) as the "prior".
For purposes of this discussion, assume the only corruption is additive Gaussian noise with zero mean and variance σ2. Since the difference between pixels fi and gi can only be due to the noise, the conditional density will be the probability of (fi-gi), simply the probability of the noise:
Now assuming the noise is independent, we find for the entire image,
Finding the estimated image
, which maximizes this density, is a trivial process. Clearly,
= g is maximal. In more complex restoration problems, we need to consider blur, and in that case, replace fi in (3) with the result of a distortion operating on the image at that point. We do not consider that case here.
The conditions under which an image may be represented as a Markov Random Field are well documented[1,8,15]. This representation allows us to write a Gibbs distribution for the prior probability of an image f:
where the sum is taken over "cliques", sets of connected pixels.We define the cliques to be adjacent pairs of pixels, horizontal and vertical.
The problem of noise removal can now be seen to be equivalent to a "Maximum a-Posteriori" (MAP) restoration problem (once we choose a suitable form for Vc) in which the only corrupting process is noise. Substituting (3) and (4) into equation (1), taking logarithms, and changing the sign, we produce a minimization problem: Find the f which minimizes the sum of the noise and prior log probabilities:
We satisfy the desire for edge-preserving smoothing by choosing a suitable form for Vc. For cliques
, we choose
noting that (except for pixels on the boundary), any one pixel in the image will occur in exactly four cliques. Furthermore, the potential function, Vc, associated with each clique is symmetric. We may therefore change the index set from cliques to pixels,
where
represents the neighborhood of pixel i; we ignore the factor of 2 that this introduces since in practice this term is multiplied by an application-specific coefficient.
The form of Eq. (7) for the prior potential is thoroughly justified in the literature. Let Λi represent some scalar measure of the brightness variation about pixel i. Then our penalty function follows a recommendation of Besag [1] "to encourage smooth variation," which says that Vi "should be strictly increasing" in the absolute value of Λι and if "occasional abrupt changes" are expected, it should "quickly reach a maximum," a criterion also used by Geman and McClure[9] and by Hebert and Leahy [13]. Finally, Blake and Zisserman [6] use the "clipped parabola" of Figure 1 which is of the same basic form.
It is comforting that the result of the MAP formulation results in an objective function so intuitively correct: the "noise term", Hn, simply says the restored image
should resemble the measured image g. The prior term simply says the restored image should be smooth except for abrupt discontinuities at edges.
A difficulty which arises immediately from a form such as Figure 1 or Eq. (7) is that the resulting objective function is not convex, and a simple minimization algorithm such as gradient descent is not guaranteed to converge to a global minimum. Blake and Zisserman solve this problem by approximating their prior with a piecewise-smooth function which depends on a parameter, p. Reducing p from 1 to 0 steadily changes the approximating function until it becomes precisely equal to the desired function. This produces a family of prior energies, illustrated in Figure 1.
The process of gradually reducing p is referred to as "Graduated non-convexity", and begins by minimizing a function which is convex, and therefore has a unique minimum. Then, from that minimum, the local minimum is tracked continuously as p is reduced from 1 to 0.
Mean Field Annealing (MFA) is a technique for finding a good minimum of complex functions which typically have many minima. The mean field approximation of statistical mechanics allows a continuous representation of the energy states of a collection of particles. In the same sense, MFA approximates the stochastic algorithm called "simulated annealing"(SA) which has been shown to converge in probability to the global minimum, even for non-convex problems[8]. MFA is derived using the analogy to physics[7] in slightly different ways in [14], which applies the technique to restoration of piecewise-constant images, and in [5], where the technique is applied to images with varying gray values. In [4], it is shown that the analogy to physics is unnecessary, and that MFA is derivable from more fundamental information-theoretic considerations.
The non-convexity issue is dealt with in MFA in a similar way to that of GNC, by adjusting a parameter, the τ of Eq. (7). Because of the analogy to statistical mechanics, τ is referred to as "temperature"(1). The process of reducing τ deforms the objective function (a homotopy) and the minimum is tracked across the resulting set of functions, avoiding most local minima.
The reader is referred to the above references for the derivation of MFA. Furthermore, since GNC and MFA can be shown to perform equivalent computations[3], we now focus only on MFA to understand its behavior more thoroughly and relate it to diffusion. We begin by noting that the argument of the exponent in Eq. (7) is an approximation to a first spatial derivative. Since we may wish to be more general than a simple first derivative, we replace fi -fj with
denoting a convolution with some edge-sensitive kernel h (e.g. derivative of a Gaussian) centered at pixel i.
We may then rewrite the prior term as
(8)
.
The kernel h may be chosen to emphasize problem-dependent image characteristics. This general form has been used to remove noise from piecewise-constant[14] and piecewise-linear[5,2] images.
Consider only the prior term. To perform gradient descent, we observe that
where hrev denotes the mirror image of the kernel h. Remembering that convolution with h is used to estimate the derivative of f, we write
for
resulting in
(10)
.
In the equation above, we have lumped the constants together into κ; set the annealing control parameter, τ, to 1 for clarity; and have made use of the observation that for centered first derivative kernels,
. We will discuss the impact of τ in the next section.
Now consider the use of
in a gradient descent algorithm. In the simplest implementations of gradient descent, f is updated by
where fk denotes the value of f at iteration k, and where α is some small constant (or, in more sophisticated algorithms, a function of the Hessian of H). Rewriting (11),
we note that the right hand side of (12) then represents a change in f between iterations k and k+1, and in fact bears a strong resemblance to the form of the time derivative of f. We make this similarity explicit by defining that iteration k is calculated at time t and iteration k+1 calculated at time
. (In similar contexts, t is sometimes known as the "evolution parameter"). Since t is an artificially introduced parameter, without physically meaningful units, we may scale it by any convenient proportionality constant, and we have
where we have allowed the constant α to be renamed
. Substituting this (re)definition into (10), the derivative of the MFA prior term may be rewritten
(14)
.
Keeping Eq. (14) in mind, let us now digress to discuss a popular image feature extraction algorithm.
Variable Conductance Diffusion (VCD)[11,19,20] is a powerful method of image feature extraction in which blurring is allowed to occur except at edges. The term "edge" may be interpreted to mean other image features of interest. For example, Whitaker[21] operates on the gradient of the image rather than the image itself and allows smoothing to occur except where ridges (sharp changes of gradient direction) occur. This approach provides a robust way to extract a central axis of a region.
VCD operates by simulating the diffusion equation
where t is time, The diffusion equation models the flow of some quantity (the most commonly used example is heat) through a material with conductivity c. In the case of variable conductance diffusion, the conductance becomes a function of the spatial coordinates, in this instance parameterized by i, and is a property of the local image intensities.
To smooth except at edges, we let ci be small if i is an edge pixel, i.e., if a selected image property is locally nonuniform. If ci is small, (in the heat transfer analogy) little heat flows, and in the image, little smoothing occurs. If, on the other hand, ci is large, then much smoothing is allowed in the vicinity of pixel i.
5.0 Equivalence of MFA and VCD
Substituting
in Eq. (14) shows that Eq. (14) is precisely the form of the diffusion equation used in VCD [11,19,20]. We have thus shown an equivalence between VCD and MFA without the noise term, (the Hn of Eq. (5)). This equivalence provides for the union of two schools of thought concerning image analysis: The first (optimization) school considers the properties that an image ought to have. It then sets up an optimization problem whose solution is the desired image. The second (process) school is more concerned with determining the appropriate spatial analysis to apply. Adaptive filtering, diffusion, template matching, etc. are more concerned with the process itself than with its affect on some hypothetical "energy function" of the image. The result of this section demonstrates that for this particular form of edge-preserving smoothing at least, the two schools are precisely equivalent.
The above equivalence considered only the prior term of the MFA objective function. Addition of the noise term converts an image feature extraction algorithm into a constrained restoration algorithm. Nordström [19] also observed a similarity between diffusion techniques and regularization (optimization) methods. He noted that "the anisotropic diffusion method" (which Whitaker calls VCD) "does not seek an optimal solution of any kind." We have produced here an objective function for which VCD does provide the minimum even though it was not previously appreciated in VCD. Nordström then proceeds to "unify from the original outlook quite different methods of regularization and anisotropic diffusion." He defines a cost function whose behavior is an anisotropic diffusion, and argues for the necessity of adding a "stabilizing cost" to "restrict the space of possible estimated image functions." At this point the reader should not be surprised to learn that the form of the stabilized cost is
which we showed in Eq. (5) is a measure of the effect of Gaussian noise in a blur-free imaging system. Thus we see that Biased Anisotropic Diffusion (BAD)[19] may be thought of as an MAP restoration of an image. This observation now permits researchers in VCD/BAD to consider use of different stabilizing costs, if they have additional information concerning the noise generation process, (e.g., Poisson noise uses a different noise term [12].
A scale space representation may be defined as a 3D data structure in which level i is generated by convolving the original image with a Gaussian of variance σi2. This variance then become the "scale parameter." Clearly, at high levels of scale, (σi large), only the largest features are visible.
Whitaker [21] utilizes scale in a diffusion process by computing the conductance initially at high values of scale. He notes "By decreasing the scale at which the gradient is measured over time, one can obtain boundary information that reflects both small scale and large scale gradient information. This results in the multi-scale anisotropic diffusion equation
in which "G[σ]⊗" denotes convolution with a Gaussian kernel of a particular size σ(t), which generally decreases as the process evolves."
Recalling Eq. (8), we note that VCD over scale space adjusts a parameter, σ, and MFA likewise adjusts ("anneals") a parameter, τ, to avoid local minima. One immediately suspects a relationship between these two parameters.
It is difficult to determine the precise relationship between temperature and scale because of the nonlinear nature of the anisotropic diffusion. For short times and small regions however, it is possible to discover a useful relationship between the two. For a region of the image which is small enough to be approximately homogeneous in conductance and for time short enough that effects of edges do not propagate into the region of consideration, we can compare the two ideas quantitatively and gain some additional insight into both.
With the assumption above, after we run both algorithms for time δt, we find the resulting image is a Gaussian convolved with the original image f0:
From (14) and (15) we observe that the only difference between VCD and MFA is the annealing term τ in the conductance. For VCD, this is
and for MFA we have
and for MFA
Here, we observe a fundamental difference between these two algorithms: VCD is formulated in a way that derives its conductance from a scaled version of the original image, whereas the conductance in MFA is determined from the current image. However, we can relate these two processes by substituting (18) into (19):
Then, for MFA
where σm =
. And we note that (21) is identical to (19) except for the division by τ2.
Thus we see that the conductance in MFA can be related to a scaled version of the original image as in VCD. As τ is reduced during the annealing process, the conductance changes, just as it does in scale-space VCD, however, it changes in a data-dependent way, continually increasing in regions where
is small, that is, away from edges. However, whenever
is large, e.g. edges, the conductance rapidly falls off.
Although we have cast our results in terms of MFA, the results are more far-reaching than that particular algorithm. First, we have shown an equivalence between the problems of image optimization and diffusion. Similar observations have been made by Nordström[19]; Geiger and Yuille came to a similar conclusion [10] for an energy function requiring explicit line processes. Since in [14], we showed that the line processes are not required, their result may now be interpreted more generally. In an image optimization problem (in particular, restoration), one defines a criterion function which is to be minimized, and applies some minimization scheme to find a global (or at least a good) minimum. This paper has shown that an image restoration problem may be considered as a combination of desires: 1) to preserve the information in the original image, and produce a resulting image which resembles that original (or the result of an operator on it) in some way and 2) to possess certain properties, such a local smoothness except at boundaries.
Second, we have shown that scale space and annealing have much in common, and in some applications, they are essentially equivalent. This understanding applies not only to image optimization methods using MFA, but to other methods [6,8,22] as well.
Finally, we note that in [5], we demonstrated that MFA operations could be calculated by a two-layer, locally connected, recurrent neural network. From this paper one may then conclude that GNC and VCD are likewise implementable by straightforward neural networks.
The equivalence between diffusion and optimization is helpful in understanding the performance of both forms of algorithms. We find the fixed points and long term behavior of diffusion algorithms are clearer in the light of their relationship with a steadily decreasing objective functions; it is also useful to see what properties are actually being minimized in the development and application of an ad-hoc technique. When we design explicit optimization methods, we find that seeing the execution of the minimizer in terms of diffusion helps to illuminate the dynamic behavior of the optimizer and to improve the way it treats relations between spatial features and the time evolution of the algorithm.
Scale variation in spatial analysis algorithms and temperature control in annealing algorithms are closely related. Finally, nonlinear operations (here, exponentials) occurring in all the algorithms that were discussed appear to be absolutely essential to success. The Kolmogorov theorem [17] demonstrates sufficiency by stating that a linear operation followed by the appropriate nonlinearity allows computation of arbitrary mapping. Here, we hypothesize that such nonlinearity is also necessary for general approaches to image restoration. Such a relationship between sophisticated numerical information processing and nonlinear functions is supported by the numerous successes of neural network applications: neural networks also depend on nonlinear transfer functions.