Parzen Window Density Estimation
A slecture by ECE student Ben Foster
Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.
still working...
Coverage
- Brief introduction to non-parametric density estimation, specifically Parzen windowing
- Brief introduction to the theory that Parzen windowing is based on
- Visualizations of Parzen windows and Parzen window-based classification
- Brief discussion of the advantages and disadvantages of the Parzen windowing method
Introduction/Motivation
So far in our study of pattern recognition and classification we have primarily focused on the use of discriminant functions as a means of classifying data. That is, for a set of classes $ \omega_c $, we choose to classify a sample $ \vec{X}_i $ to a class c if that class is most probable given what we know about the sample.
The phrase ``most probable" implies that we are dealing with probability distributions defined in the normal way, which is correct. As one might guess, the probability distributions that are used to map samples to classes are not always of immediately obvious character and/or easy to obtain. Maximum likelihood estimation (MLE) and Bayesian parameter estimation are fairly broad categories of methodologies that attempt to estimate the parameters of the underlying distributions of data, and the expectation-maximization (EM) algorithm is an oft-used particular method of estimating these parameters.
However, not all density-estimation methods are dependent on having prior knowledge of the distributions of the data to be classified. ``Non-parametric" methods eschew assumptions about the distribution of data to varying degrees, thus circumventing some of the issues associated with more Bayesian methods. Though there are a number of non-parametric density-estimation methods that are widely employed, this lecture will focus on one of the most popular: Parzen window estimation. The following survey of the method will hopefully shed some light on the pros and cons of the Parzen window method individually as well as the advantages and disadvantages of non-parametric estimation in general. Additionally, a direct application of Parzen window estimation to a classification problem will be presented and discussed.
Parzen Windows
- Convergence to $ p(\vec{X}) $ - Are we actually constructing a p.d.f.?
The motivation for choosing a non-parametric method for estimating a density function is largely derived from a lack of prior information about the density function that corresponds to an experimenter's data. Without a substantial amount of information about the distribution of data (and conditional distributions of data belonging to each class) it is near impossible to do parametric density estimation, namely maximum likelihood estimation (MLE) and Bayesian parameter estimation (BPE). Recall that for MLE, the estimated parameter vector $ \hat{\theta} $ corresponds to the value of $ \vec{\theta} $ that maximizes the likelihood function, i.e.:
Also recall that BPE involves using Bayes' rule to obtain the conditional distribution of the parameter vector $ \theta $ given the data $ \hat{X} $:
Clearly, both of these methods depend strongly on knowledge of the conditional distributions of the data $ \vec{X} $ which, again, is not always readily available. Additionally it is not necessarily true that the true distribution of a given dataset is described well by one of the commonly-used p.d.f.'s.
In both cases, it seems that one could be well-served to try to construct a p.d.f. based on the data at hand to circumvent these issues. If we do not already have an idea of the p.d.f. that describes the data, why can't do things backwards and use the data to construct a p.d.f.? However it is not immediately obvious that such a procedure gives a robust result (or is even possible).
It turns out that if we are careful in our definition, we can in fact construct such a p.d.f.. Chapter 4, Section 2 of Duda, Hart, and Stork's Pattern Classification [1] provides such a definition which is the basis for the following analysis. We start by observing very generally that the probability P that a vector $ \vec{X} $ will lie in a region $ \mathcal{R} $ is given by
If we now draw n samples X$ _1 $,...,X$ _n $, we can express the probability that k of these n fall in $ \mathcal{R} $ with the binomial law:
We also know that for the binomial law the expected value for k is equal to the number of samples drawn multiplied by the probability of success, i.e.:
which is easily rearranged into
We assume that the above is a good estimate for P since the binomial distribution is know to peak strongly near its mean. Now comes the key assumption for our derivation. Let us say that the region $ \mathcal{R} $ is very small, so small that the function p does not vary within it in a significant way. Now we can define the voxel V as the volume enclosed by $ \mathcal{R} $ which, along with our assumption about $ \mathcal{R} $'s size, allows us to write
Now that we have the above expression in hand and know that P is equal to k/n, we can write our desired result:
still working...