Difference between revisions of "Parzen Window Density Estimation" - Rhea

Revision as of 15:45, 29 April 2014

Parzen Window Density Estimation

A slecture by ECE student Ben Foster

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

still working...

Coverage

Brief introduction to non-parametric density estimation, specifically Parzen windowing
Brief introduction to the theory that Parzen windowing is based on
Visualizations of Parzen windows and Parzen window-based classification

Introduction/Motivation

So far in our study of pattern recognition and classification we have primarily focused on the use of discriminant functions as a means of classifying data. That is, for a set of classes $\omega_c$ , we choose to classify a sample $\vec{X}_i$ to a class c if that class is most probable given what we know about the sample.

The phrase ``most probable" implies that we are dealing with probability distributions defined in the normal way, which is correct. As one might guess, the probability distributions that are used to map samples to classes are not always of immediately obvious character and/or easy to obtain. Maximum likelihood estimation (MLE) and Bayesian parameter estimation are fairly broad categories of methodologies that attempt to estimate the parameters of the underlying distributions of data, and the expectation-maximization (EM) algorithm is an oft-used particular method of estimating these parameters.

However, not all density-estimation methods are dependent on having prior knowledge of the distributions of the data to be classified. ``Non-parametric" methods eschew assumptions about the distribution of data to varying degrees, thus circumventing some of the issues associated with more Bayesian methods. Though there are a number of non-parametric density-estimation methods that are widely employed, this lecture will focus on one of the most popular: Parzen window estimation. The following survey of the method will hopefully shed some light on the pros and cons of the Parzen window method individually. Additionally, a direct application of Parzen window estimation to a classification problem will be briefly discussed.

Parzen Windows

Convergence to $p(\vec{X})$ - Are we actually constructing a p.d.f.?

The motivation for choosing a non-parametric method for estimating a density function is largely derived from a lack of prior information about the density function that corresponds to an experimenter's data. Without a substantial amount of information about the distribution of data (and conditional distributions of data belonging to each class) it is near impossible to do parametric density estimation, namely maximum likelihood estimation (MLE) and Bayesian parameter estimation (BPE). Recall that for MLE, the estimated parameter vector $\hat{\theta}$ corresponds to the value of $\vec{\theta}$ that maximizes the likelihood function, i.e.:

\hat{\theta} = \underset{\theta}{\operatorname{argmax}}\prod\limits_{k=1}^{n}p(\vec{X}_k | \vec{\theta})

Also recall that BPE involves using Bayes' rule to obtain the conditional distribution of the parameter vector $\theta$ given the data $\hat{X}$ :

P(\hat{\theta}|\vec{X}) = \frac{P(\vec{X}|\vec{\theta})P(\vec{\theta})}{\int P(\vec{X}|\vec{\theta})P(\vec{\theta})d\vec{\theta}}

Clearly, both of these methods depend strongly on knowledge of the conditional distributions of the data $\vec{X}$ which, again, is not always readily available. Additionally it is not necessarily true that the true distribution of a given dataset is described well by one of the commonly-used p.d.f.'s.

In both cases, it seems that one could be well-served to try to construct a p.d.f. based on the data at hand to circumvent these issues. If we do not already have an idea of the p.d.f. that describes the data, why can't do things backwards and use the data to construct a p.d.f.? However it is not immediately obvious that such a procedure gives a robust result (or is even possible).

It turns out that if we are careful in our definition, we can in fact construct such a p.d.f.. Chapter 4, Section 2 of Duda, Hart, and Stork's Pattern Classification [1] provides such a definition which is the basis for the following analysis. We start by observing very generally that the probability P that a vector $\vec{X}$ will lie in a region $\mathcal{R}$ is given by

P = \int_{\mathcal{R}} p(\vec{X})d\vec{X}

If we now draw n samples X $$ _1 $$ ,...,X $$ _n $$ , we can express the probability that k of these n fall in $\mathcal{R}$ with the binomial law:

P_k = {n\choose k}P^k(1-P)^{n-k}

We also know that for the binomial law the expected value for k is equal to the number of samples drawn multiplied by the probability of success, i.e.:

$ E[k] = nP $

which is easily rearranged into

P = \frac{k}{n}

We assume that the above is a good estimate for P since the binomial distribution is know to peak strongly near its mean. Now comes the key assumption for our derivation. Let us say that the region $\mathcal{R}$ is very small, so small that the function p does not vary within it in a significant way. Now we can define the voxel V as the volume enclosed by $\mathcal{R}$ which, along with our assumption about $\mathcal{R}$ 's size, allows us to write

\int_{\mathcal{R}} p(\vec{X})d\vec{X}\approx p(\vec{X})\cdot V

Now that we have the above expression in hand and know that P is equal to k/n, we can write our desired result:

p(\vec{X})\approx\frac{k/n}{V}

It is important to note that the above equation is an approximation. It is very difficult to establish true equality in the limit as V goes to zero. Since we would like to think of this density estimation practically, imagine that we have a finite number of samples n and take V to zero. Then there is an incredibly small chance that one of the samples actually falls within V, meaning that our estimate of $p(\vec{X})$ goes to zero. Even if a sample does lie within V, it is easy to see that in such a case $p(\vec{X})$ goes to infinity as V goes to zero. Neither of these cases is helpful in our search for the true $p(\vec{X})$ .

If we allow the number of samples n to go to infinity, however, we are able to establish convergence of the previous equation to equality. Let us rewrite the previous equation (with equality instead of approximation) as a sequence that changes with n:

p_n(\vec{X}) = \frac{k_n/n}{V_n}

This sequence only converges to $p(\vec{X})$ if three conditions are met:

$\underset{n\to\infty}{\lim}V_n = 0$
$\underset{n\to\infty}{\lim} k_n = \infty$
$\underset{n\to\infty}{\lim} k_n/n = 0$

We have already established that we need condition 1 to be true if we would like to establish true equality in our most recent equation. This condition is necessary so that we are in fact finding the exact value of $p(\vec{X})$ (not the average value over some space with finite volume). The second condition is needed so that $\frac{k_n}{n}$ converges to P, and the third condition ensures that the total number of samples k that lie in $\mathcal{R}$ are an infinitesimal fraction of the total number of samples n. One way of making sure that these three conditions are satisfied is by defining V in terms of n in such a way that V $$ _n $$ shrinks as n grows. One popular function that satisfies this is $V_n = \frac{1}{\sqrt{n}}$ ; this function helps to form the foundation of the Parzen window method that is popular for non-parametric density estimation.

Density Estimation - What does our p.d.f. look like?

Now that we have established that we can design a function that converges to the true p.d.f. of our data given a high enough number of samples (n), we would like to flesh out the bare-bones results of our analysis into a more practical form. As per the treatment in [1], let us claim that V $$ _n $$ is the volume corresponding to a hypercube with edge length h $$ _n $$ . Now we can define a window function $\varphi(\vec{X})$ such that

\varphi(\vec{X}) \geq 0

and

\int\varphi(\vec{X})d\vec{X} = 1

It should be clear that $\varphi(\vec{X})$ is a density function that has the same dimensionality as $\vec{X}$ .