回到主页

Is more data better?

降维与数据提取

· Data mining

In the era of Big Data, people are obsessed by tens of thousands of data produced every day, hoping to find some certain mystery beyond the facts. However, the paradox of big data is, manual errors or too many variables inside the data will mislead the “truth”.

Data reduction is a solution for further data analyses and data mining by helping you simplify the data set so you can focus on the data more likely to carry the meaning you care about, and less likely to carry the noise that distracts you from your purposes.

This process is called reduction in variables, or fields. Why you need to do that? Firstly, practicality. It may save time for running the data analysis program and save hard drive space when you get an enormous data set. Secondly, reduce the noise. Actually, the meaningless information is hiding among the data and distracting your attention. Reducing the irrelevant data and focusing on the most relevant one may help you easier to find the regularities and easier to interpret the results. There is a nice analogy of data simplification: the idea of projecting a shadow. A higher dimensional space projects a shadow into a lower dimensional one and still can tell what it is.

Statistically, we can avoid multicollinearity by data reduction. When your variables are correlated with each other, it can cause instability in your model. We can also increase degrees of freedom especially when we have only hundreds of cases. It can also avoid the model that fits well on your sample date but doesn’t work well to other situations.

broken image

One of the big decisions you have to make with data reduction is algorithms choosing. There are two categories we could choose. First is linear method. The most common linear method is principal component analysis or, PCA. PCA can reduce the number of dimensions by trying to maximize the variability of data in lower dimensional spaces. For example, we can run our regression line through the scatter plot and draw a perpendicular segment from each point to the regression line. When we rotate the line to y, we can still keep the variability on x, but not y.

broken image

There are a couple of considerations about PCA, As following,

broken image

Second is non-linear method, specifically, nonlinear methods for dimensionality reduction. Nonlinear methods are very common in topics like computer vision and machine learning. In terms of this, there is a few choices to use, for example, kernel PCA, Isomap, Locally linear embedding and maximum variance unfolding. But they are pretty hard to interpret the results. If you’re using a black box method, it doesn’t matter, but if human involved, you want to emphasize interpretability.