Machine Learning - Who's the Boss?

In the Machine Learning field, there are two types of algorithms that can be applied to a set of data to solve different kinds of problems: Supervised and Unsupervised learning algorithms. Both of these have in common that they aim to extract information or gain knowledge from the raw data that would otherwise be very hard and unpractical to do. This is because we live in very dynamic environments with changing parameters and vast amounts of data being gathered. This data hides important patterns and correlations that are sometimes impossible to deduce manually, and where computing power and smart algorithms excel. They are also heavily dependent on the quantity and quality of the input data, and as such, evolve in their output and accuracy as more and better input data becomes available.

In this article we will walk through what constitues Supervised and Unsupervised Learning. An overview of the language and terms is presented, as well as the general workflow used for machine learning tasks.

Supervised Learning

In supervised machine learning we have a set of data points or observations for which we know the desired output, class, target variable or outcome. The outcome may take one of many values called classes or labels. A classic example is that given a few thousand emails for which we know whether they are spam or ham (their labels), the idea is to create a model that is able to deduce whether new, unsean emails are spam or not. In other words, we are creating a mapping function where the inputs are the email’s sender, subject, date, time, body, attachments and other attributes, and the output is a prediction as to whether the email is spam or ham. The target variable is in fact providing some level of supervision in that it is used by the learning algorithm to adjust parameters or make decisions that will allow it to predict labels for new data. Finally of note, when the algorithm is predicting labels of observations we call it a classifier. Some classifiers are also capable of providing a probability of a data point belonging to class in which case it is often referred to a probabilistic model or a regression - not to be confused with a statistical regression model.

Lets take this as an example in supervised learning algorithms. Given the following dataset, we want to predict on new emails whether they are spam or not. In the dataset below, note that the last column, Spam?, contains the labels for the examples.

Subject Date Time Body Spam?
I has the viagra for you 03/12/1992 12:23 pm Hi! I noticed that you are a software engineer
so here’s the pleasure you were looking for…
Important business 05/29/1995 01:24 pm Give me your account number and you’ll be rich. I’m totally serial Yes
Business Plan 05/23/1996 07:19 pm As per our conversation, here’s the business plan for our new venture Warm regards… No
Job Opportunity 02/29/1998 08:19 am Hi !I am trying to fill a position for a PHP … Yes
[A few thousand rows ommitted]
Call mom 05/23/2000 02:14 pm Call mom. She’s been trying to reach you for a few days now No

A common workflow approach, and one that I’ve taken for supervised learning analysis is shown in the diagram below:

The process is:

  1. Scale and prepare training data: First we build input vectors that are appropriate for feeding into our supervised learning algorithm.
  2. Create a training set and a validation set by randomly splitting the universe of data. The training set is the data that the classifier uses to learn how to classify the data, whereas the validation set is used to feed the already trained model in order to get an error rate (or other measures and techniques) that can help us identify the classifier’s performance and accuracy. Typically you will use more training data (maybe 80% of the entire universe) than validation data. Note that there is also cross-validation), but that is beyond the scope of this article.
  3. Train the model. We take the training data and we feed it into the algorithm. The end result is a model that has learned (hopefully) how to predict our outcome given new unknown data.
  4. Validation and tuning: After we’ve created a model, we want to test its accuracy. It is critical to do this on data that the model has not seen yet - otherwise you are cheating. This is why on step 2 we separated out a subset of the data that was not used for training. We are indeed testing our model’s generalization capabilities. It is very easy to learn every single combination of input vectors and their mappings to the output as observed on the training data, and we can achieve a very low error in doing that, but how does the very same rules or mappings perform on new data that may have different input to output mappings? If the classification error of the validation set is very big compared to the training set’s, then we have to go back and adjust model parameters. The model will have essentially memorized the answers seen in the training data, loosing its generalization capabilities. This is called overfitting, and there are various techniques for overcoming it.
  5. Validate the model’s performance. There are numerous techniques for achieving this, such as ROC analysis and many others. The model’s accuracy can be improved by changing its structure or the underlying training data. If the model’s performance is not satisfactory, change model parameters, inputs and or scaling, go to step 3 and try again.
  6. Use the model to classify new data. In production. Profit!

Unsupervised Learning

The kinds of problems that are suited for unsupervised algorithms may seem similar, but are very different to supervised learners. Instead of trying to predict a set of known classes, we are trying to identify the patterns inherent in the data that separate like observations in one way or another. Viewed from 20 thousand feet, the main difference is that we are not providing a target variable like we did in supervised learning.

This marks a fundamental difference in how both types of algorithms operate. On one hand, we have supervised algorithms which try to minimize the error in classifying observations, while unsupervised learning algorithms don’t have such luxuries because there are no outcomes or labels. Unsupervised algorithms try to create clusters of data that are inherently similar. In some cases we don’t necessarily know what makes them similar, but the algorithms are capable of finding these relationships between data points and group them in significant ways. While supervised algorithms aim to minimize the classification error, unsupervised algorithms aim to create groups or subsets of the data where data points belonging to a cluster are as similar to each other as possible, while making the difference between the clusters as high as possible.

Another main difference is that in a clustering problem, the concept of “Training Set” does not apply in the same way as with supervised learners. Typically we have a dataset that is used to find the relationships in the data that buckets them in different clusters. We could of course apply the same clustering model to new data, but unless it is too unpractical to do so (perhaps for performance reasons), we will most certainly want to rerun the algorithm on new data as it will typically find new relationships within the data that may surface up given the new observations.

As a simple example, you could imagine clustering customers by their demographics. The learning algorithm may help you discover distinct groups of customers by region, age ranges, gender and other attributes in such way that we can develop targeted marketing programs. Another example may be to cluster patients by their chronic diseases and comorbidities in such a way that targeted interventions can be developed to help manage their diseases and improve their lifestyles.

For unsupervised learning, the process is:

  1. Scale and prepare raw data: As with supervised learners, this step entails selecting features to feed into our algorithm, and scaling them to build a suitable data set.
  2. Build model: We run the unsupervised algorithm on the scaled dataset to get groups of like observations.
  3. Validate: After clustering the data, we need to verify whether it cleanly separated the data in significant ways. This includes calculating a set of statistics on the resulting clusters (such as the within group sum of squares), as well as analysis based on domain knowledge, where you may measure how certain attributes behave when aggregated by the clusters.
  4. Once we are satisfied with the clusters created there is no need to run the model with new data (although you can). Profit!

Step zero

A common step that I have not outlined above and should be performed when working on any such problem is to get a strong understanding for the characteristics of the data. This should be a combination of visual analysis (for which I prefer the excellent ggplot2 library) as well as some basic descriptive statistics and data profiling such as quartiles, means, standard deviation, frequencies and others. R’s Hmisc package has a great function for this purpose called describe.

I am convinced that not performing this step is a non starter for any datamining project. It will allow you to identify missing values, general distributions of data, early outlier detection, among many other characteristics that drive the selection of attributes for your models.

Wrapping up

This is certainly quite a bit of info, especially if these terms are new to you. To summarize:

Supervised Learning Unsupervised Learning
Objective Classify or predict a class. Find patterns inherent to the data, creating cluster of like data points. Dimensionality Reduction.
Example Implementations Neural Networks (Multilayer Perceptrons, RBF Networks and others, Support Vector Machines, Decision Trees (ID3, C4.5 and others), Naive Bayes Classifiers K-Means (and variants), Hierarchical Clustering, Kohonen Self Organizing Maps
Who’s the Boss? The target variable or outcome. The relationships inherent to the data.

Hopefuly this article shows the main differences between Unsupervised and Supervised Learning. On followup posts we will dig into some of the specific implementations of these algorithms with examples in R and Ruby