(24 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
<br> | <br> | ||
<center><font size="4"></font> | <center><font size="4"></font> | ||
− | <font size="4">Generation of | + | <font size="4">Generation of normally distributed random numbers from two categories with different priors </font> |
− | A [ | + | A [http://www.projectrhea.org/learning/slectures.php slecture] by Jonghoon Jin |
− | Partly based on the [[ | + | Partly based on the [[2014_Spring_ECE_662_Boutin_Statistical_Pattern_recognition_slectures|ECE662 Spring 2014 lecture]] material of [[User:Mboutin|Prof. Mireille Boutin]]. |
</center> | </center> | ||
---- | ---- | ||
Line 11: | Line 11: | ||
---- | ---- | ||
− | <font size="5"> | + | <font size="5">Source PDF file download</font> |
− | + | This page is built based on MediaWiki + LaTeX math package parser. Some of contents were modified due to its limited compatibility. The original PDF file was written in PDFTeX and can be downloaded from the following link. | |
− | + | <center>[[Media:slecture.pdf|Download PDF file generated by PDFTeX]]</center> | |
− | |||
− | + | <font size="5">Introduction</font> | |
− | + | ||
− | + | ||
− | + | In the most of pattern recognition, decision and classification problems, the concept of training becomes popular with the advance of Machine Learning and computing power that can afford to expensive computatioal costs. Such a trainable system is a record-holder for many open problems and its powerful capability is successfully built based on the massive volume of information in a dataset. Due to the importance of data, people not only try to obtain a big quantity of data, but also concentrate a good quality of data. Often, a real-world data contains either high variance or high bias hence it is difficult to estimate true density and using such data does not necessarily result in good performance. In the circumstance, a naive assumption about the class distribution helps us synthesize data so that we can train models with a consistent dataset. Among many distributions, Normal distribution is frequently used in many literatures. This tutorial will explain how to generate binary data classes from normal distributions with prior probabilities in a perspective on numerical experiments. | |
− | + | It consists of the following sections: | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | # Prior selection : How to set priors? | |
+ | # Normal distribution : How it work? Which is more efficient? | ||
+ | #* Central limit theorem | ||
+ | #* Inverse transform sampling method | ||
+ | #* Box-Muller transform | ||
+ | #* Ziggurat Algorithm | ||
+ | # Conversion from normalized distribution | ||
+ | # Conclusion | ||
<br> | <br> | ||
Line 40: | Line 40: | ||
To be specific, prior probability affects the generation of samples for binary classification in a way described in the following. First, sampling data from uniform distribution <math>(0, 1)</math> and let the class priors be <math>Prob(\omega_1)</math> and <math>Prob(\omega_2)</math> respectively where the condition <math>Prob(\omega_1) + Prob(\omega_2) = 1</math> holds. If a random sample is drawn in the range <math>[0, Prob(\omega_1)]</math>, label the sample as class 1, then, continue to generating a normal random number based on the class 1 statistics <math>(\mu, \sigma)</math>. Otherwise, the sample belongs to <math>[Prob(\omega_2), 1]</math> and should be labeled as class 2, then, move onto the normal random number generation step with the class 2 statistics like the same way as we did for class 1. In summary, | To be specific, prior probability affects the generation of samples for binary classification in a way described in the following. First, sampling data from uniform distribution <math>(0, 1)</math> and let the class priors be <math>Prob(\omega_1)</math> and <math>Prob(\omega_2)</math> respectively where the condition <math>Prob(\omega_1) + Prob(\omega_2) = 1</math> holds. If a random sample is drawn in the range <math>[0, Prob(\omega_1)]</math>, label the sample as class 1, then, continue to generating a normal random number based on the class 1 statistics <math>(\mu, \sigma)</math>. Otherwise, the sample belongs to <math>[Prob(\omega_2), 1]</math> and should be labeled as class 2, then, move onto the normal random number generation step with the class 2 statistics like the same way as we did for class 1. In summary, | ||
− | \[ X_i = \ | + | <center><math>X_i = X_i^{1} \sim N(\mu_1, \sigma_1) \quad X_i \in [0, Prob(\omega_1)]</math></center><br> |
+ | <center><math>X_i = X_i^{2} \sim N(\mu_2, \sigma_2) \quad \text{otherwise}</math></center> | ||
− | + | where the superscript of <math>X_i</math> indicates the label of the class. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | where the superscript of | + | |
<br> | <br> | ||
Line 54: | Line 50: | ||
Once you decide the class of the sample, now it is time to generate actual random number that follows the class statistics determined in the prior selection step. The normal distribution is frequently used in many statistical models and being able to draw samples from the normal distribution lies at the heart of many probabilistic learning algorithms. The probability distribution function for the normal distribution is defined as | Once you decide the class of the sample, now it is time to generate actual random number that follows the class statistics determined in the prior selection step. The normal distribution is frequently used in many statistical models and being able to draw samples from the normal distribution lies at the heart of many probabilistic learning algorithms. The probability distribution function for the normal distribution is defined as | ||
− | \ | + | <center><math>y = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{\left(x-\mu\right)^2}{2\sigma^2}\right\}</math></center> |
− | + | where <math>\mu</math> is the distribution mean and <math>\sigma</math> is the standard deviation. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
It is easy to generate an uniform distribution, but usually generation of a normal distribution requires additional steps rather than having a single step solution. Central limit theorem, inverse transform sampling method, Box-Muller transformation method and Ziggurat algorithm are examples to generate such distribution, given a source of uniform distribution. Since each of methods has its pros and cons, the optimal selection among those algorithm may differs under different conditions. | It is easy to generate an uniform distribution, but usually generation of a normal distribution requires additional steps rather than having a single step solution. Central limit theorem, inverse transform sampling method, Box-Muller transformation method and Ziggurat algorithm are examples to generate such distribution, given a source of uniform distribution. Since each of methods has its pros and cons, the optimal selection among those algorithm may differs under different conditions. | ||
Line 69: | Line 61: | ||
Central limit theorem states that when a sufficiently large number of samples drawn from independent random variables, the arithmetic mean of their distributions will be have a normal distribution as commonly known as a bell-shaped distribution. This implies that sampling from identical and independent uniform distributions will result in a distribution which will be normally distributed as the number of samples involved are large enough. This is shown in the following | Central limit theorem states that when a sufficiently large number of samples drawn from independent random variables, the arithmetic mean of their distributions will be have a normal distribution as commonly known as a bell-shaped distribution. This implies that sampling from identical and independent uniform distributions will result in a distribution which will be normally distributed as the number of samples involved are large enough. This is shown in the following | ||
− | + | <center><math>\lim_{n \to \infty}\frac{X_1 + \dots + X_n}{n} \approx \mu</math><br></center> | |
− | + | <center><math>\lim_{n \to \infty}\frac{X_1^2 + \dots + X_n^2}{n} \approx \sigma^2 + \mu^2</math></center> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | where <math>\left\{X_1, \dots, X_n\right\}</math> be a random sample of size <math>n</math> and <math>\mu</math>, <math>\sigma</math> represents the mean and standard deviation for a normal distribution. However, in practice approaching a normal probabilistic distribution to a high accuracy using only central limit theorem requires an impractically large number of samples. Therefore, using this method to generate normal distribution is expensive and not plausible. | |
<br> | <br> | ||
Line 85: | Line 73: | ||
Random numbers that follow a normal distribution can be obtained by the following procedure. First, it starts from Poisson distribution described below. | Random numbers that follow a normal distribution can be obtained by the following procedure. First, it starts from Poisson distribution described below. | ||
− | \ | + | <center><math>p(k) = P(X = k) = e^{-\lambda} \frac{\lambda^k}{k!}, k \ge 0 </math></center> |
− | + | <math>\lambda</math> and <math>k</math> are all deterministic since <math>\lambda</math> is a given mean and <math>k</math> is a <math>n</math>th point of Poisson process. If we can simulate <math>X = N(1)</math> where <math>\left\{N(t) : t \ge 0 \right\}</math>, Poisson random number <math>X</math> can be obtained by using the formula | |
− | + | ||
− | + | ||
− | \ | + | <center><math> X = N(1) - 1 = min \left\{n \ge 1 | t_n > 1 \right\} - 1</math></center> |
− | <math>\ | + | where <math>t_n = X_1 + \dots + X_n</math>. The relation above comes from the exponential term in Poission distribution <math>e^{-\lambda}n</math>, that is <math>X_i = - \frac{1}{\lambda} \ln{u_i}</math>. Using such a relation with additivity of logarithm helps to generate a Poisson random number easily by consecutively adding <math>X_i</math> (equivalently, multiplying uniform random numbers <math>u_i</math>). Details are described as |
− | \ | + | <center><math>X = \min\left\{ n \ge 1 : \sum_{i=1}^{n} \ln(u_i) < -\alpha \right\} - 1 </math></center><br> |
+ | <center><math>\qquad = \min\left\{ n \ge 1 : \ln\left(\prod_{i=1}^{n} u_i\right) < -\alpha \right\} - 1 </math></center><br> | ||
+ | <center><math>\qquad = \min\left\{ n \ge 1 : \prod_{i=1}^{n} u_i < e^{-\alpha} \right\} - 1</math></center> | ||
− | + | where the product with new <math>u_i</math> drawn from <math>U \sim \text{uniform}(0, 1)</math> is repeated until we find <math>X = \prod_{i=1}^{n} u_i < e^{-\alpha}</math>. Note that for large <math>\lambda</math> in a Poisson, its distribution is approximately normal with mean of <math>\lambda</math> and variance of <math>\lambda</math> by the central limit theorem. As a result, a normal distribution can be generated once we obtain a Poisson distribution based on the samples from uniform distribution using inverse transform sampling method. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
This method involves computing the quantile function of the distribution which utilizes the cumulative distribution function and then inverting that function. This is why the terminology is called ``inverse''method.'' For a discrete distribution, summation over all individual samples drawn from uniform distribution yields desired distributions. However, for a continuous cases, integration over probability density function is needed like the same way as we did in discrete domain, but it is impossible to obtain an analytical solution for most distributions. This shortcoming makes this method computationally inefficient in continuous domain and the alternative such as Box-Muller transform can be used. | This method involves computing the quantile function of the distribution which utilizes the cumulative distribution function and then inverting that function. This is why the terminology is called ``inverse''method.'' For a discrete distribution, summation over all individual samples drawn from uniform distribution yields desired distributions. However, for a continuous cases, integration over probability density function is needed like the same way as we did in discrete domain, but it is impossible to obtain an analytical solution for most distributions. This shortcoming makes this method computationally inefficient in continuous domain and the alternative such as Box-Muller transform can be used. | ||
Line 118: | Line 92: | ||
<font size="4">Box-Muller Transform</font> | <font size="4">Box-Muller Transform</font> | ||
− | The method generates a normal distribution given a source of uniform distribution. Main key of this method is to utilize the relation between Cartesian and polar coordinates. The polar form picks two samples from the interval, | + | The method generates a normal distribution given a source of uniform distribution. Main key of this method is to utilize the relation between Cartesian and polar coordinates. |
+ | The polar form picks two samples from the interval, [-1, +1], and maps them to two independent samples that are normally distributed without the use of sine or cosine functions. | ||
+ | The relation between two different domains can be characterized as few equations | ||
− | \ | + | <center><math>r^2 = x_1^2 + x_2^2, \qquad \tan\theta = \frac{x_2}{x_1}</math></center><br> |
+ | <center><math>x_1 = r \cos\theta, \qquad x_2 = r \sin\theta</math></center> | ||
− | + | where <math>\theta \in [0, 2\pi]</math> and we assume <math>|r| \le 1</math> since we are going to use <math>r \sim \text{uniform}(0, 1)</math> as a basic building block. | |
− | + | Such region covers the area contained in the unit circle in polar coordinate. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | where <math>\theta \in [0, 2\pi]</math> and we assume <math>|r| \le 1</math> since we are going to use <math>r \sim \text{uniform}(0, 1)</math> as a basic building block. Such region covers the area contained in the unit circle in polar coordinate. | + | |
Box-Muller sampling is based on the joint distribution of two independent random variables <math>x_1 \sim N(0, 1), x_2 \sim N(0,1)</math> in Cartesian coordinate. If we convert such distributions in polar coordinate, the joint distribution <math>p(x_1, x_2)</math> becomes | Box-Muller sampling is based on the joint distribution of two independent random variables <math>x_1 \sim N(0, 1), x_2 \sim N(0,1)</math> in Cartesian coordinate. If we convert such distributions in polar coordinate, the joint distribution <math>p(x_1, x_2)</math> becomes | ||
− | + | <center><math>p(x_1, x_2) = p(x_1) p(x_2) = \frac{1}{2\pi}\exp\left\{-\frac{x_1^2}{2}\right\}\exp\left\{-\frac{x_2^2}{2}\right\}</math></center><br> | |
− | + | <center><math>= \frac{1}{2\pi}\exp\left\{-\frac{x_1^2 + x_2^2}{2}\right\}</math></center><br> | |
− | + | <center><math>= \frac{1}{2\pi}\exp\left\{-\frac{r_2}{2}\right\}</math></center> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
From the last line of the equation, we can see that the distribution follows an exponential distribution with respect to <math>r^2</math>. Precisely, distributions in polar coordinate are: | From the last line of the equation, we can see that the distribution follows an exponential distribution with respect to <math>r^2</math>. Precisely, distributions in polar coordinate are: | ||
− | + | <center><math>r^2 \sim \text{exponential}\left(\frac{1}{2}\right), \qquad \theta \sim \text{uniform}\left(0, 2\pi\right)</math></center> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
Since the angle <math>\theta</math> already follows an uniform distribution, it can be simply drawn from an uniform distribution. However, the additional work is needed to generate the exponential distribution. The connection between uniform and exponential can be formulated as follows: | Since the angle <math>\theta</math> already follows an uniform distribution, it can be simply drawn from an uniform distribution. However, the additional work is needed to generate the exponential distribution. The connection between uniform and exponential can be formulated as follows: | ||
− | + | <center><math>\text{exponential}\left(\lambda\right) = \frac{-\log\left(\text{uniform}(0, 1)\right)}{\lambda}</math></center> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
Then, combining with the equation in terms of <math>r^2</math> and solving for <math>r</math> gives, | Then, combining with the equation in terms of <math>r^2</math> and solving for <math>r</math> gives, | ||
− | + | <center><math>r \sim \sqrt{-2 \log\left(\text{uniform}(0, 1)\right)}</math></center> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | Now all variables <math>r</math> and <math>\theta</math> can be expressed in terms of uniform distributions. By back-tracking above procedure, we will be able to | + | Now all variables <math>r</math> and <math>\theta</math> can be expressed in terms of uniform distributions. By back-tracking above procedure, we will be able to successively generate a normal distribution from out of uniform distributions. This way to generate normal random numbers is known as Box-Muller transformations with polar method, which makes Box-Muller transformation more efficient by avoiding direct calculations of cosine and sine functions. |
<br> | <br> | ||
Line 184: | Line 129: | ||
This method partitions a target distribution into a small blocks so that the whole distribution can be described as an union of blocks which is | This method partitions a target distribution into a small blocks so that the whole distribution can be described as an union of blocks which is | ||
− | \ | + | <center><math>Z = \bigcup_{i=0}^C B_i</math></center> |
− | + | where the area of each <math>B_i</math> is a rectangle whose width extends from <math>x=0</math> to <math>x= x_i</math> horizontally and whose height vertically extends from <math>f(x_i)</math> to <math>f(x_{i+1})</math> (details are shown in the following equation). | |
− | + | ||
− | + | ||
− | + | <center><math>B_i = 1 \quad \left\{(x, y) | 0 < x < x_i, f(x_i) < y < f(x_{i+1})\right\}</math></center><br> | |
− | + | <center><math>B_i = 0 \quad \text{otherwise}</math></center> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
Here <math>x_i</math>s are <math>x</math>-values that monotonically increases from the center to the tail for a given normal distribution. | Here <math>x_i</math>s are <math>x</math>-values that monotonically increases from the center to the tail for a given normal distribution. | ||
− | The algorithm first choose a random block <math>i</math> among all possible <math>B_i</math>s. Then, it draws a random number <math>u_0</math> from uniform <math>(0,1)</math> and calculate <math>z</math> by multiplying the x-axis index <math>i</math> with <math>u_0</math>. If the randomly generated <math>z</math> belongs to any of partitioned blocks, <math>z</math> returns as a normal random number. Otherwise, it again calculates a similar step for <math>y</math>-axis to check whether the sample belongs to wedge or not. Since understanding of the principle of this algorithm requires background knowledge, details are out of scope for this slecture and please refer to the paper | + | The algorithm first choose a random block <math>i</math> among all possible <math>B_i</math>s. Then, it draws a random number <math>u_0</math> from uniform <math>(0,1)</math> and calculate <math>z</math> by multiplying the x-axis index <math>i</math> with <math>u_0</math>. If the randomly generated <math>z</math> belongs to any of partitioned blocks, <math>z</math> returns as a normal random number. Otherwise, it again calculates a similar step for <math>y</math>-axis to check whether the sample belongs to wedge or not. Since understanding of the principle of this algorithm requires background knowledge, details are out of scope for this slecture and please refer to the paper titled ``The Ziggurat Method for Generating Random Variables''.'' |
The critical values such as the edges of the rectangle (<math>x_i</math> or <math>f(x_i)</math>) used in Ziggurat method are stored in memory, which improves efficiency in terms of computational cost. Moreover, it does not need to evaluate an exponential function all the time while Box-Muller method computes the exponential (or logarithm) every time. The ziggurat algorithm will calculate an exponential function only if samples are in the wedge. For these reason, the ziggurat algorithm is the fastest way to generate but relatively difficult to implement it is best used when large quantities of random numbers are needed. | The critical values such as the edges of the rectangle (<math>x_i</math> or <math>f(x_i)</math>) used in Ziggurat method are stored in memory, which improves efficiency in terms of computational cost. Moreover, it does not need to evaluate an exponential function all the time while Box-Muller method computes the exponential (or logarithm) every time. The ziggurat algorithm will calculate an exponential function only if samples are in the wedge. For these reason, the ziggurat algorithm is the fastest way to generate but relatively difficult to implement it is best used when large quantities of random numbers are needed. | ||
Line 208: | Line 145: | ||
<font size="5">Conversion from normalized distribution</font> | <font size="5">Conversion from normalized distribution</font> | ||
− | Any normal distribution with <math>\mu_0</math> and <math>\sigma_0</math> can be easily converted to another normal distribution with a custom mean <math>\mu_1</math> and standard deviation <math>\sigma_1</math>. For convenience and resource saving purpose, random numbers are generated from normalized normal distribution that follows <math>N(0, 1)</math> from a | + | Any normal distribution with <math>\mu_0</math> and <math>\sigma_0</math> can be easily converted to another normal distribution with a custom mean <math>\mu_1</math> and standard deviation <math>\sigma_1</math>. For convenience and resource saving purpose, random numbers are generated from normalized normal distribution that follows <math>N(0, 1)</math> from a single unified source, then such numbers are converted to follow a custom normal distribution simply using the following equation. |
− | + | <center><math>Z = \frac{X - \mu_0}{\sigma_0} \quad \text{where } Z \sim N(0, 1) \text{ and } X \sim N(\mu_0, \sigma_0)</math></center><br> | |
− | + | <center><math>X = \sigma_i Z + \mu_i \quad \text{where } Z \sim N(0, 1) \text{ and } X \sim N(\mu_i, \sigma_i)</math></center> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | < | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
<br> | <br> | ||
<font size="5">Conclusion</font> | <font size="5">Conclusion</font> | ||
− | Now we understand how to generate | + | Now we understand how to generate normally distributed random numbers from two categories with different priors using samples taken from uniform distributions. The generation of random numbers with a normal distribution combined with the prior selection step discussed in the earlier section provides statistically consistent datasets, which will improve the accuracy of trainable systems. While class selection based on the prior probabilities is relatively simple, the generation of random numbers with a normal distribution has many issues to be considered because of the trade-off between its computational cost and the accuracy. The four methods covered in this work have their own pros and cons, but practically efficiency and complexity are the most important concerns to be considered when deciding an optimal model for given task. |
<br> | <br> | ||
<font size="5">References</font> | <font size="5">References</font> | ||
− | + | # Dong-U Lee, Wayne Luk, John Villasenor and Peter Cheung, ``A Hardware Gaussian Noise Generator for Channel Code Evaluation | |
− | + | # Raul Toral and Amitabha Chakrabarti, ``Generation of Gaussian distributed random numbers by using a numerical inversion method | |
− | + | # Hassan Edrees, Brian Cheung, McCullen Sandora, David Nummey and Deian Stefan ``Hardware-Optimized Ziggurat Algorithm for High-Speed Gaussian Random Number Generators | |
− | + | # Central Limit Theorem, wikipedia, http://en.wikipedia.org/wiki/Central_limit_theorem | |
− | + | # Inverse Transform Sampling, wikipedia, http://en.wikipedia.org/wiki/Inverse_transform_sampling | |
− | + | # Box-Muller Transform, wikipedia, http://en.wikipedia.org/wiki/Box_Muller_transform | |
− | + | # Box-Muller Transformation, mathworld, http://mathworld.wolfram.com/Box-MullerTransformation.html | |
− | + | # Emiliano A. Valdez, Lecture note, http://www.math.uconn.edu/~valdez/math3634s09/Math3634-Weeks4to5.pdf | |
− | + | # Karl Sigman, Lecture note, http://www.columbia.edu/~ks20/4404-Sigman/4404-Notes-ITM.pdf | |
− | + | # Henrik Schoioler, Lecture note, http://www.control.auc.dk/~henrik/undervisning/DES/lec03.pdf | |
− | + | # http://theclevermachine.wordpress.com/2012/09/11/sampling-from-the-normal-distribution-using-the-box-muller-transform/ | |
− | |||
− | |||
− | |||
− | |||
<br> | <br> | ||
− | |||
− | |||
− | |||
− | |||
---- | ---- |
Latest revision as of 09:40, 22 January 2015
Generation of normally distributed random numbers from two categories with different priors
A slecture by Jonghoon Jin
Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.
Source PDF file download
This page is built based on MediaWiki + LaTeX math package parser. Some of contents were modified due to its limited compatibility. The original PDF file was written in PDFTeX and can be downloaded from the following link.
Introduction
In the most of pattern recognition, decision and classification problems, the concept of training becomes popular with the advance of Machine Learning and computing power that can afford to expensive computatioal costs. Such a trainable system is a record-holder for many open problems and its powerful capability is successfully built based on the massive volume of information in a dataset. Due to the importance of data, people not only try to obtain a big quantity of data, but also concentrate a good quality of data. Often, a real-world data contains either high variance or high bias hence it is difficult to estimate true density and using such data does not necessarily result in good performance. In the circumstance, a naive assumption about the class distribution helps us synthesize data so that we can train models with a consistent dataset. Among many distributions, Normal distribution is frequently used in many literatures. This tutorial will explain how to generate binary data classes from normal distributions with prior probabilities in a perspective on numerical experiments.
It consists of the following sections:
- Prior selection : How to set priors?
- Normal distribution : How it work? Which is more efficient?
- Central limit theorem
- Inverse transform sampling method
- Box-Muller transform
- Ziggurat Algorithm
- Conversion from normalized distribution
- Conclusion
Prior Selection
Prior probability gives an idea on which class is more likely to be observed among all possible classes. For binary data classes with normal distributions, it is important to draw samples based on priors before investigating how to generate distributions. This is because in practice the number of samples contains implicit information about prior probability and also empirical prior is different from the true prior used to determine class selection where the empirical prior is calculated by only looking at the ratio of class 1 and class 2 for binary classification example.
To be specific, prior probability affects the generation of samples for binary classification in a way described in the following. First, sampling data from uniform distribution $ (0, 1) $ and let the class priors be $ Prob(\omega_1) $ and $ Prob(\omega_2) $ respectively where the condition $ Prob(\omega_1) + Prob(\omega_2) = 1 $ holds. If a random sample is drawn in the range $ [0, Prob(\omega_1)] $, label the sample as class 1, then, continue to generating a normal random number based on the class 1 statistics $ (\mu, \sigma) $. Otherwise, the sample belongs to $ [Prob(\omega_2), 1] $ and should be labeled as class 2, then, move onto the normal random number generation step with the class 2 statistics like the same way as we did for class 1. In summary,
where the superscript of $ X_i $ indicates the label of the class.
Normal Distribution Generation
Once you decide the class of the sample, now it is time to generate actual random number that follows the class statistics determined in the prior selection step. The normal distribution is frequently used in many statistical models and being able to draw samples from the normal distribution lies at the heart of many probabilistic learning algorithms. The probability distribution function for the normal distribution is defined as
where $ \mu $ is the distribution mean and $ \sigma $ is the standard deviation.
It is easy to generate an uniform distribution, but usually generation of a normal distribution requires additional steps rather than having a single step solution. Central limit theorem, inverse transform sampling method, Box-Muller transformation method and Ziggurat algorithm are examples to generate such distribution, given a source of uniform distribution. Since each of methods has its pros and cons, the optimal selection among those algorithm may differs under different conditions.
Central Limit Theorem
Central limit theorem states that when a sufficiently large number of samples drawn from independent random variables, the arithmetic mean of their distributions will be have a normal distribution as commonly known as a bell-shaped distribution. This implies that sampling from identical and independent uniform distributions will result in a distribution which will be normally distributed as the number of samples involved are large enough. This is shown in the following
where $ \left\{X_1, \dots, X_n\right\} $ be a random sample of size $ n $ and $ \mu $, $ \sigma $ represents the mean and standard deviation for a normal distribution. However, in practice approaching a normal probabilistic distribution to a high accuracy using only central limit theorem requires an impractically large number of samples. Therefore, using this method to generate normal distribution is expensive and not plausible.
Inverse Transform Sampling Method
Inverse transform sampling method is a fundamental method in sampling for random numbers and can generate any types of probabilistic distributions given a source of uniform distributions. This method involves computing the quantile function. Which of a probabilistic distribution is the inverse of its cumulative distribution. Basically, it generates random numbers $ u $ from the uniform distribution in the interval $ [0, 1] $. Then, using equations for the desired probabilistic distribution, it computes the value that satisfies $ F(x) = u $. Finally, the number $ x $ drawn from $ F $ density distribution is considred as a random sample.
Random numbers that follow a normal distribution can be obtained by the following procedure. First, it starts from Poisson distribution described below.
$ \lambda $ and $ k $ are all deterministic since $ \lambda $ is a given mean and $ k $ is a $ n $th point of Poisson process. If we can simulate $ X = N(1) $ where $ \left\{N(t) : t \ge 0 \right\} $, Poisson random number $ X $ can be obtained by using the formula
where $ t_n = X_1 + \dots + X_n $. The relation above comes from the exponential term in Poission distribution $ e^{-\lambda}n $, that is $ X_i = - \frac{1}{\lambda} \ln{u_i} $. Using such a relation with additivity of logarithm helps to generate a Poisson random number easily by consecutively adding $ X_i $ (equivalently, multiplying uniform random numbers $ u_i $). Details are described as
where the product with new $ u_i $ drawn from $ U \sim \text{uniform}(0, 1) $ is repeated until we find $ X = \prod_{i=1}^{n} u_i < e^{-\alpha} $. Note that for large $ \lambda $ in a Poisson, its distribution is approximately normal with mean of $ \lambda $ and variance of $ \lambda $ by the central limit theorem. As a result, a normal distribution can be generated once we obtain a Poisson distribution based on the samples from uniform distribution using inverse transform sampling method.
This method involves computing the quantile function of the distribution which utilizes the cumulative distribution function and then inverting that function. This is why the terminology is called ``inversemethod. For a discrete distribution, summation over all individual samples drawn from uniform distribution yields desired distributions. However, for a continuous cases, integration over probability density function is needed like the same way as we did in discrete domain, but it is impossible to obtain an analytical solution for most distributions. This shortcoming makes this method computationally inefficient in continuous domain and the alternative such as Box-Muller transform can be used.
Box-Muller Transform
The method generates a normal distribution given a source of uniform distribution. Main key of this method is to utilize the relation between Cartesian and polar coordinates. The polar form picks two samples from the interval, [-1, +1], and maps them to two independent samples that are normally distributed without the use of sine or cosine functions. The relation between two different domains can be characterized as few equations
where $ \theta \in [0, 2\pi] $ and we assume $ |r| \le 1 $ since we are going to use $ r \sim \text{uniform}(0, 1) $ as a basic building block. Such region covers the area contained in the unit circle in polar coordinate.
Box-Muller sampling is based on the joint distribution of two independent random variables $ x_1 \sim N(0, 1), x_2 \sim N(0,1) $ in Cartesian coordinate. If we convert such distributions in polar coordinate, the joint distribution $ p(x_1, x_2) $ becomes
From the last line of the equation, we can see that the distribution follows an exponential distribution with respect to $ r^2 $. Precisely, distributions in polar coordinate are:
Since the angle $ \theta $ already follows an uniform distribution, it can be simply drawn from an uniform distribution. However, the additional work is needed to generate the exponential distribution. The connection between uniform and exponential can be formulated as follows:
Then, combining with the equation in terms of $ r^2 $ and solving for $ r $ gives,
Now all variables $ r $ and $ \theta $ can be expressed in terms of uniform distributions. By back-tracking above procedure, we will be able to successively generate a normal distribution from out of uniform distributions. This way to generate normal random numbers is known as Box-Muller transformations with polar method, which makes Box-Muller transformation more efficient by avoiding direct calculations of cosine and sine functions.
Ziggurat Algorithm
Ziggurat method is the fastest algorithm for generating normal random numbers. It is more efficient than Box-Muller method but is more complicated algorithm. Due to the efficiency and speed, it has been implemented in various libraries such as GNU scientific library and MatLab.
This method partitions a target distribution into a small blocks so that the whole distribution can be described as an union of blocks which is
where the area of each $ B_i $ is a rectangle whose width extends from $ x=0 $ to $ x= x_i $ horizontally and whose height vertically extends from $ f(x_i) $ to $ f(x_{i+1}) $ (details are shown in the following equation).
Here $ x_i $s are $ x $-values that monotonically increases from the center to the tail for a given normal distribution.
The algorithm first choose a random block $ i $ among all possible $ B_i $s. Then, it draws a random number $ u_0 $ from uniform $ (0,1) $ and calculate $ z $ by multiplying the x-axis index $ i $ with $ u_0 $. If the randomly generated $ z $ belongs to any of partitioned blocks, $ z $ returns as a normal random number. Otherwise, it again calculates a similar step for $ y $-axis to check whether the sample belongs to wedge or not. Since understanding of the principle of this algorithm requires background knowledge, details are out of scope for this slecture and please refer to the paper titled ``The Ziggurat Method for Generating Random Variables.
The critical values such as the edges of the rectangle ($ x_i $ or $ f(x_i) $) used in Ziggurat method are stored in memory, which improves efficiency in terms of computational cost. Moreover, it does not need to evaluate an exponential function all the time while Box-Muller method computes the exponential (or logarithm) every time. The ziggurat algorithm will calculate an exponential function only if samples are in the wedge. For these reason, the ziggurat algorithm is the fastest way to generate but relatively difficult to implement it is best used when large quantities of random numbers are needed.
Conversion from normalized distribution
Any normal distribution with $ \mu_0 $ and $ \sigma_0 $ can be easily converted to another normal distribution with a custom mean $ \mu_1 $ and standard deviation $ \sigma_1 $. For convenience and resource saving purpose, random numbers are generated from normalized normal distribution that follows $ N(0, 1) $ from a single unified source, then such numbers are converted to follow a custom normal distribution simply using the following equation.
Conclusion
Now we understand how to generate normally distributed random numbers from two categories with different priors using samples taken from uniform distributions. The generation of random numbers with a normal distribution combined with the prior selection step discussed in the earlier section provides statistically consistent datasets, which will improve the accuracy of trainable systems. While class selection based on the prior probabilities is relatively simple, the generation of random numbers with a normal distribution has many issues to be considered because of the trade-off between its computational cost and the accuracy. The four methods covered in this work have their own pros and cons, but practically efficiency and complexity are the most important concerns to be considered when deciding an optimal model for given task.
References
- Dong-U Lee, Wayne Luk, John Villasenor and Peter Cheung, ``A Hardware Gaussian Noise Generator for Channel Code Evaluation
- Raul Toral and Amitabha Chakrabarti, ``Generation of Gaussian distributed random numbers by using a numerical inversion method
- Hassan Edrees, Brian Cheung, McCullen Sandora, David Nummey and Deian Stefan ``Hardware-Optimized Ziggurat Algorithm for High-Speed Gaussian Random Number Generators
- Central Limit Theorem, wikipedia, http://en.wikipedia.org/wiki/Central_limit_theorem
- Inverse Transform Sampling, wikipedia, http://en.wikipedia.org/wiki/Inverse_transform_sampling
- Box-Muller Transform, wikipedia, http://en.wikipedia.org/wiki/Box_Muller_transform
- Box-Muller Transformation, mathworld, http://mathworld.wolfram.com/Box-MullerTransformation.html
- Emiliano A. Valdez, Lecture note, http://www.math.uconn.edu/~valdez/math3634s09/Math3634-Weeks4to5.pdf
- Karl Sigman, Lecture note, http://www.columbia.edu/~ks20/4404-Sigman/4404-Notes-ITM.pdf
- Henrik Schoioler, Lecture note, http://www.control.auc.dk/~henrik/undervisning/DES/lec03.pdf
- http://theclevermachine.wordpress.com/2012/09/11/sampling-from-the-normal-distribution-using-the-box-muller-transform/
Questions and comments
If you have any questions, comments, etc. please post them on this page.