Tuesday, 23 July 2013

Algorithms 101

Bright North have started to run sessions on Fridays, the idea behind which is for the team to learn more about what people in other roles are doing. In that context, I gave a brief presentation about algorithms which gave enough information for people to better understand what data scientists are talking about when they are discussing algorithms!

To do so, I decided to explain how decision trees are built and used to make predictions. Decision trees are a good family of algorithms to use to illustrate some general principals because:

  • They are widely used
  • They are based on simple principles
  • And they yield easy to interpret results

The first concept discussed was the difference between regression and classification:

  • In regression, we are trying to predict a numerical value (eg. income)
  • In classification, we are trying to predict to which class an observation belongs (eg. spam/mail)

Decision trees can be used for both regression and classification!

We then discussed the need for a training dataset and a test dataset:

  • The training dataset contains data for all the observed variables including the one we want to predict. We will need it so that our algorithm can “learn” how to make predictions
  • The test dataset also contains data for all the variables, but we hide the data for the variable we want to predict. We will use it to assess how good a job our prediction algorithm is doing

Based on those first elements, we then went to the main topic: how to build a decision tree (induction or learning). This is where we use the training dataset and apply the chosen algorithm to build a model, which we will then use to make predictions. To keep things as simple as possible, we only introduced the main ideas of a generic tree induction algorithm:

  • Starting from a top node, we look for all the possible variables and all the possible split points
  • We estimate the quality of those splits (misclassification error, purity – entropy or Gini)
  • We go for the best possible split and then repeat the operations by adding a new level to the tree

The last point discussed was how to decide when to stop growing a tree and why it is important not to “learn” too much about the training set. This allowed us to briefly discuss the concept of overfitting. The idea is to allow the build of a generic enough model so that it can do a good prediction job on unseen data (that is where we would use our test set!).

We finished by a quick live demo in R using CART to further highlight some of the points discussed above.
Iris Species Classification

Lot of concepts were introduced in half an hour! But in all, the initial objective was met; Bright Sparks not working in the data team will now have an idea of what we are talking about when we discuss “training set”, “learning” or “overfitting”!