回到主页

Algorithms for classification

Human-in-the-loop or black box running?

· Data mining

When come to the issue of guiding posts, or shooting spam in your email box, it will lead to the question which algorithms should be used to classify them? There is one very general piece of advices: whether humans are involved or not.

broken image

For human-in-the-loop, for example, making the decisions on some business practice we could actually use decision trees, random forest k-NN, or naïve Bayes. These algorithms help us to understand the whole process and better to interpret.

 

What decision trees do is split the data into groups following the variables, then subgroups, then it keeps branching down until it can’t split anymore. The end is also called a terminal node. If you got a collection of decision trees, then you can have a random forest. It’s very easy to implement, very flexible, and very easy to interpret.

 

K-NN, or K-nearest neighbors, inputs your data into a multidimensional space where each variable is a dimension, and the goal is to find your case location in this space, and filter the k cases in the space which are closest to your case. It’s your choice to find how many cases similar to your case, i.e K=5, 7, 24,26. And when you do that, you simply look at what category those cases are in.

 

Naïve Bayes, which creates a posterior probability with each new piece of information, and forecast the overall prior probability of a group, for example, the probability that if a person in a general population having a physical disease versus not having it. Why it’s called naïve Bayes,  because it ignores the relationship between predictor variables, but it still works well in most situations.

broken image

For black box running, it’s more sophisticated and tricky to interpret. SVM is an algorithm among this group. SVM tries to find a hyperplane(a straight line) at very high dimensions to cleanly separate two different groups, and for another very sophisticated approach are artificial neural network, or ANN. And what these try to do, is creating a nonlinear model in multiply layers of equations to predict the outcome. We just need to put data in, wait for the processing, and get the outputs. If accuracy is paramount and doesn’t need a person to consult with, and you are willing to take a very complicated, opaque system to do that, then SVM and ANN can be acceptable.

Keep those in mind, you’ll be able to make more informative and useful analysis for your data science projects.