How to generate random data two classes Minwoong Kim review - Rhea

Questions and Comments for: How to Generate N-D Gaussian Data in Two Category Case

A slecture by Minwoong Kim

Please leave me comment below if you have any questions, if you notice any errors or if you would like to discuss a topic further.

Hyungju Andy Park Review

This slecture explains two ways to generate synthetically N-dimensional gaussian distribution data according to their priors in MATLAB. The data for two classes was generated in 2D as an example. First way was to come up with the sample size for each class by multiplying each prior with the total number of samples, and then use this in "mvnrnd" function to generate a 2D gaussian data with other parameters (e.g., specified means and covariances) as input parameters. The second way was using a uniform random variable in vector operations. A uniform random vector of the same size of the whole dataset is first generated, whose each element is in the range of (0, 1). Then using a vector operation, such as "find", the set of indices of elements that are equal or smaller than class 1's prior is found, and vice versa for class 2. These two sets of indices (two vectors whose elements are those indices) are, then, used in "mvnrnd" function with mean and covariance to generate the dataset for each class. It was demonstrated in MATLAB plots that using first way, the ratios of the number of samples in each of two classes exactly matched the ratio of the priors, but it matched with some errors when second way was used.
In general, the explanation was simple in a way that it would be easy to understand even for people who do not have deep knowledge in random variables & probabilities. The demonstrations (figures & codes) in MATLAB were good in a sense that people can directly implement on their own what was explained.
Here are some of my thoughts for suggestions. I feel that it would have been better if the author had shared his own thoughts on the advantages and limitations in each method after explaining the methods, e.g., which way would make more sense to be used in real-world examples. I think then it will help people to decide which method to use for their application/problems. Along the same line, I think it would be helpful if the author had also provided any explanation on whether the difference between the two methods can play any role in terms of classification accuracy.

Write Question/Comment Here

Back to How to Generate N-D Gaussian Data in Two Category Case

How to generate random data two classes Minwoong Kim review - Rhea

Alumni Liaison