Machine Learning: Supervised Learning

20 min readApr 11, 2024

Introduction

Imagine a world where computers learn from data, uncover hidden patterns, and make predictions without being explicitly programmed. Welcome to the realm of machine learning, where algorithms empower machines to mimic human-like intelligence and revolutionize industries across the globe. Welcome to my machine learning (ML) series where I provide an overview of the premise underlying ML processes and algorithms!

In recent years, the field of ML has surged to the forefront of technological innovation, driving transformative advancements in areas ranging from healthcare and finance to entertainment and transportation. At its core, machine learning is the science of teaching computers to learn from data, iteratively improving their performance on specific tasks without being explicitly programmed. Data science helps, and occasionally supplants, fundamental science by learning, rather than discovering, the laws of nature. With its diverse range of methods and approaches, machine learning offers a powerful toolkit for extracting structure from big noisy data, making predictions, and automating decision-making processes. In this article, we will explore the fundamental methods and models in supervised learning.

The Machine Learning Workflow

The machine learning workflow encompasses several essential stages. A brief overview goes something like this. It first begins with data processing, where relevant datasets are collected, cleaned, and transformed to ensure quality and compatibility. This is then followed by data visualization and exploration, involving exploratory data analysis (EDA) and feature visualization to gain insights into the dataset’s distribution, correlations, and patterns. Next comes the most important process, feature engineering. It involves selecting, creating, and scaling features to enhance the model’s predictive power while reducing the dimensionality and complexity. Model selection entails choosing appropriate algorithms, evaluation metrics, and cross-validation techniques to assess and compare model performance. Optimization via a loss function involves training the selected model, fine-tuning its parameters through hyperparameter tuning, and validating its performance on a separate dataset to ensure generalization. Finally, deployment and monitoring involve deploying the trained model into production environments, monitoring its performance over time, and updating it as needed to adapt to changing data distributions.

Supervised Learning

Firstly, let us define supervised learning. Supervised learning is a type of machine learning where the algorithm learns from labelled data, which means that each input data point is paired with a corresponding target label. The goal of supervised learning is to learn a mapping from inputs to outputs by minimizing the difference between the predicted output and the actual target label. We utilize historical datasets containing both input features and corresponding output labels. By feeding this data into a machine learning model, the model learns to make predictions by comparing its output with the known historical labels. Through an iterative process known as training, the model adjusts its internal parameters to minimize the discrepancy between its predictions and the actual outputs. This iterative refinement enables the model to generalize patterns and relationships from the training data, ultimately improving its predictive accuracy on unseen or future data instances. Common tasks in supervised learning include classification (where the output is a category or class label) and regression (where the output is a continuous value).

Note To The Reader
Machine Learning consists of the following prerequisites:
· A basic grasp of Python Programming, mainly NumPy, Pandas and Matplotlib
· Linear Algebra
· Statistics
· Probability Theory
· Basic Calculus

Lastly, do not let the mathematics in this article scare you. While it is vital to know the mathematics behind the ML algorithms, learning the fundamental theory of such models and how they work is much more important. With a wide array of libraries such as SK-Learn available, implementation is as easy as a four-line code. May this article provide you with the insights and fundamentals needed to implement your ML modules.

How To Train Models

There are two ways to train models in general, we can either use a direct closed-form equation that directly computes the model parameters that best fit the model to the training set by choosing model parameters that minimize the cost function over the training set or use an iterative optimization approach, known as Gradient Descent, which gradually twerks the model parameters to minimize the cost function over the training set, eventually converging to the same set of parameters, just like the first method. This article will focus on the latter training method due its dynamic and robust approach which is more applicable to other more complex models.

Here is a non-technical example I found useful. Suppose you are lost in the mountains in a dense fog, and you can only feel the slope of the ground below your feet. Your objective is to get to ground level and thus, at every step, you consider which direction is the steepest before stepping. Repeat this and go downhill in the direction of the steepest slope. Just like how you consider which direction is the steepest, the Gradient Descent Algorithm measures the local gradient of the cost function with respect to the parameters’ value and it goes in the direction of the descending gradient. This step repeats until the gradient is zero which is when the algorithm signals that the cost function has reached a minimum.

Note that the one of the most important parameters in Gradient Descent is the learning rate. In our mountain example, it is the size of step you take. If the learning rate is too small, then the algorithm will go through many iterations before convergence. This will take a very long time. However, if your learning rate is too big, you may skip over the minimum and end up higher than you were before. This will then fail to find a good solution. Furthermore, cost functions may have multiple local minimums. Therefore, picking an overly small learning rate means that you may a very long time to jump across local minimums or even get stuck in one and picking an overly large learning rate means that you will never reach the global minimum or end up with a sub-optimal solution. Gradient Descent methods include Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent. Usually, early stopping is implemented when the validation error stops decreasing and starts to go back up, indicating that the model has started to overfit the data. With early stopping in place, the training is stopped as soon as the validation error reaches a minimum.

Regression Analysis

We will first dive into regression algorithms where it is used for the prediction of values. Starting off with the simplest model, Linear Regression.

A linear model makes a prediction by simple computing a weighted sum of the input features plus a constant called the bias term. Training a model means setting its parameters so that the model best fits the training set.

Linear Regression Equation in Linear Algebra (Vectorized)

A measurement criterion is needed. The most common performance measure of a regression is known as the Mean Square Error (MSE) or Root Mean Square Error (RMSE). Thus, to train a Linear Regression model, we find values of the parameters that minimises the MSE. Using Gradient Descent, it decreases RMSE until the algorithm converges to a minimum and updates its parameters accordingly.

Minimizing Linear Regression Model using Gradient Descent

Simple, now you have a calibrated ML model that can predict values when new data is inputted into the model! However, Linear Regression assumes that the dataset is relatively linear. If we ever encounter a non-linear data set, it’s training and testing performance will hinder. Thankfully, there is a solution to this. We can make the model more complex. One way is to have a Polynomial Regression that is capable to finding relationships between multiple features. We realise that the higher the degree of polynomial, the better our model trains on training set but may perform poorly on our testing set. How? We have just run into the problem of overfitting. This is illustrated in the Bias-Variance Trade-Off shown below. A way to tackle this is to use regularization (Lasso, Ridge or Elastic) which essentially constraints the model with fewer degrees of freedom to overfit the data.

Bias-Variance Trade-Off

The bias part is in in-sample difference between the model and the target which, in principle, be made arbitrarily small by using a more complex mode. Doing so, however, will drive up the model variance, or generalization error, when switching from training to previously unseen data. An important characteristic of a model is complexity which is the number of predictors, also known as the degrees of freedom. As a function of complexity, the bias is decreasing and the variance is increasing, so the model error is minimized at some optimal complexity level that is data and model dependent. Thus, increasing a model’s complexity will typically increase its variance an reduce its bias. Conversely, reducing a model’s complexity increases its bias and reduces its variance.

Regularized Linear Models

Ridge Regression also known as the L2 Norm adds a regularization term (or penalty term) to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. It introduces a hyperparameter called alpha which controls how much you want to regularize the model. For instance, if alpha is zero then it behaves like a normal regression model. If it is very large, then all weights end up very close to zero and the result is a flat line going through the data’s mean.

Lasso Regression (L1 Norm — Least Absolute Shrinkage and Selection Operator Regression) is another version of regularized Linear Regression. It also adds a regularization term to the cost function but instead of L1 Norm, it uses the L2 Norm.

The big difference between L1 and L2 Norm is that L1 Norm tends to completely eliminate the weights of the least important features, automatically performing feature selection and outputs a sparse model. On the other hand, L2 Norm will not completely eliminate the weights of the regression although it is very close to zero.

Elastic Net is a middle ground between Ridge and Lasso Regression. The regulation term is a simple mix of both Ridge and Lasso Regularization terms. It has a parameter, r. When r = 0, Elastic Net is equivalent to Ridge Regression and when r = 1, it is equivalent to Lasso Regression.

So, what should you use? Lasso and Ridge Regression can be used if you only suspect that only a few features are useful as they tend to reduce the useless features’ weight down to zero. Other than that, Ridge is a good default. In general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.

Binary and Multi-Labelled Classification

Regression algorithms can be used for classification as well.

Logistic Regression is commonly used to estimate the probability that an instance belongs to a particular class by using a threshold. For instance, if the estimated probability is above the threshold, then the model predicts that the instance belongs to that class, otherwise, it predicts that it does not. Logistic Regression is a Binary Classifier.

Just like a Linear Regression model, Logistic Regression computers weighted sum of input features and a bias term, but instead of outputting a value like Linear Regression, it outputs the log of the value (just like the name suggests).

Performance Metrics include:

Confusion Matrix (for a Binary Classification problem) — a 2 x 2 matrix consisting of the following:

· True Positives (TP): when the actual value is Positive and predicted is also Positive.

· True Negatives (TN): when the actual value is Negative, and prediction is also Negative.

· False Positives (FP): When the actual is Negative, but prediction is Positive. Also known as the Type 1 error

· False Negatives (FN): When the actual is Positive, but the prediction is Negative. Also known as the Type 2 error

A perfect Binary Classifier would have only true positives and true negatives, so its confusion matrix would have nonzero values only on its main diagonal (see image below).

Confusion matrix gives you a lot of information, but sometimes you may prefer a more concise metric. Other performance measures of a Binary Classifier other than the Confusion Matrix include:

- Precision

- Recall (TPR, Sensitivity)

- F1-Score

Precision is typically used with recall, which is the sensitivity or true positive rate. It is often convenient to combine precision and recall into a single metric called the F1-Score, in particular the F1-Sore is the harmonic mean of precision and recall. While the regular mean treats all values equally, the harmonic mean gives much more weight to low values and as a result, the classifier will only get a high F1-Score if both recall, and precision are high. However, we must note the Precision-Recall Trade-Off.

Precision-Recall Trade-Off

Improving precision often comes at the expense of recall, and vice versa. For instance, a model with high precision tends to make fewer positive predictions but is more accurate when it does, while a model with high recall captures a larger portion of positive instances but may also produce more false positives. Therefore, finding an optimal threshold balance is important. Lowering threshold increases recall and reduces precision, vice versa.

One way is to plot precision and recall as functions of the threshold value and to find the intercept between them. The choice will vary depending on your project.

ROC Curve

Another Performance tool used is called the receiver operating characteristic (ROC) curve. It is very similar to plotting precision versus recall, but the ROC curve plots the true positive rate (recall) against the false positive rate (FPR). The FPR is the ratio of negative instances that are incorrectly classified as positive, it is equal to one minus the true negative rate, known as specificity.

The curve illustrates the trade-off between sensitivity and specificity, showing how well the model discriminates between the positive and negative classes. A model with perfect discrimination would have an ROC curve that hugs the top-left corner, with high sensitivity and low false positive rate across all thresholds. The dotted line represents the ROC curve of a purely random classifier, a good classifier stays as far away from that line as possible. The area under the ROC curve (AUC-ROC) quantifies the overall performance of the model, where a higher AUC indicates better discrimination ability.

Training Binary Classifiers

The objective of training a Logistic Regression model is to set the parameter vector so that the model estimates high probabilities for positive instance by assigning it a value of 1 and a low probability of negative instance by assigning it a value of 0. Intuitively, the cost function will be large if the model estimates a probability close to 0 for a positive instance and it will be very large if the model estimates a probability close to 1 for a negative instance. This is because log(t) grows very large when t approaches 0, vice versa. The cost function averages the cost over all training iterations, known as the log-loss. The cost function is convex and thus, Gradient Descent is guaranteed to find the global minimum.

Like any other Linear Models, Logistic Regression can be regularized using L1, L2 or Elastic Net.

However, when faced with a multi-Class classifier, Logistic Regression will be generalized to what we know as a SoftMax Regression. The idea is that when given an instance, the SoftMax Regression will first compute a score for each class and then estimate the probability of each class by applying the SoftMax Function.

Once the score has been computer for all instances, the estimated probability that the instance belongs to that class by passing through the SoftMax Function (refer to figure above). It computes the exponential of every score and normalizes them, this is known as Logits.

The SoftMax Regression Classifier predicts the class with the highest estimated probability. Minimizing the cost function — the Cross Entropy — penalizes the model when it estimates a low probability for a target class. It is frequently used to measure how well a set of estimated class probabilities match the target class. Cross Entropy stems from Information Theory where it measures the average number of bits sent per option. If you have all the information that is correct, your Cross Entropy will be equal to the Entropy of the information itself. However, there is a difference in Information, the difference is measured by the Kullback-Leibler Divergence.

Algorithms for Regression and Classification Analysis
Decision Trees

We have talked about specific algorithms for regression and classification analysis, however, there are versatile ML algorithms that can perform both regression and classification tasks. One of them is known as Classification and Regression Trees (CART) algorithm.

In classification tasks, decision trees partition the feature space into regions and assign a class label to each region. The tree structure consists of nodes, where each node represents a decision based on a feature value, and branches, which represent the possible outcomes of the decision. At each node, the decision tree algorithm selects the feature that best splits the data into purest subsets according to the performance metrics such as the Gini impurity or Entropy. This process continues recursively until the tree reaches a stopping criterion, such as a maximum depth or minimum number of samples per leaf. To make predictions, a new data point traverses the decision tree from the root node to a leaf node, following the decisions at each node based on its feature values, and is assigned the majority class label of the training instances in the leaf node.

The performance metric known as Gini impurity is a measure of impurity or randomness used in decision trees, primarily for classification tasks. It quantifies the degree of inequality or disorder in a dataset by calculating the probability of misclassifying a randomly chosen element if it were randomly labelled according to the distribution of class labels in the data. The Gini impurity takes on a value between 0 and 1, where 0 indicates perfect purity (all elements belong to the same class), and 1 indicates maximum impurity (elements are evenly distributed among different classes). Decision trees use the Gini impurity as a criterion to determine the optimal splits during the tree-building process. At each node, the algorithm selects the feature and threshold that minimize the weighted sum of the Gini impurities of the resulting child nodes, aiming to maximize the purity of the subsets generated by the split weighted by their size. It will keep splitting the subsets using this logic until it cannot find a split that will reduce impurity or once it reaches its maximum dept.

Impurity Criterion {Gini Index and Entropy}

In regression tasks, decision trees predict a continuous target variable instead of discrete class labels. Similar to classification, the decision tree algorithm recursively partitions the feature space based on feature values, but instead of maximizing class purity, it minimizes the variance of the target variable within each partition. At each node, the algorithm selects the feature and corresponding threshold that minimizes the variance of the target variable across the resulting subsets. Performance metrics include MSE or RMSE. The process continues until a stopping criterion is met, such as a maximum depth or minimum number of samples per leaf. To make predictions, a new data point traverses the decision tree from the root node to a leaf node, following the decisions at each node based on its feature values, and is assigned the average target value of the training instances in the leaf node.

Both regressions and classifications go through a series of if-else statements and minimizes the performance metric relative to the decision of split.

A big advantage of decision trees is that it is intuitive, and their decisions are easy to interpret, this are known as white box models. This is unlike Ensemble Methods or Neural Networks that are considered black box models, where they make great predictions, but it is difficult to explain why the predictions were made like that.

Another advantage is that it is able to gain insights from feature importance. Decision trees can not only be visualized to inspect the decision path for a given feature, but can also summarize the contribution of each feature to the rules learned by the model to fit the training data. Feature importance captures how much the splits produced by each feature help optimize the model’s metric used to evaluate the split quality (Gini impurity). A feature’s importance is computed as the (normalized) total reduction of this metric and takes into account the number of samples affected by a split. Hence, features used earlier in the tree where the nodes tend to contain more samples are typically considered of higher importance.

However, CART is easily prone to overfitting as it is a greedy algorithm. It greedily searches for an optimum split at the top level, then repeats the process at each level. It does not check whether the split will lead to the lowest possible impurity several levels down. Therefore, the tree structure will adapt itself to the training data, fitting it very closely. This will lead to overly complex trees, which are reflected in in a large number of leaf nodes or partitioning of the feature space. The high variance of the decision trees is tied to their ability to closely adapt to a training set. As a result, minor variations in the data can produce wide swings in the tree’s structure and, consequently, the model’s predictions.

Like Linear Regression, we can implement a form of “regularization” by hard coding its parameters. We can restrict the maximum depth of the tree, leaf nodes and number of sample leaf. This is known as pruning the tree. Statistical methods such as Chi-Squared Test can be used to estimate the probability that the improvement is purely the result of chance. If it is statistically significant, the node is considered unnecessary, and the nodes are deleted. To overcome the high variance of CART is to use an ensemble of randomized decision trees that have low bias and produce uncorrelated prediction errors.

CART is also sensitive to unbalanced class weights and may produce biased trees. Therefore, some may argue to oversample the underrepresented classes or under sample the more frequent class, though it may be better to use class weights and directly adjust the objective function.

One way to diagnose training set size is with learning curves. A learning curve is a useful tool that displays how the validation and training scores evolve as the number of training samples increases. The purpose of the learning curve is to find out whether and how much the model would benefit from using more data during training. It also helps to diagnose whether the model’s generalization error is more likely driven by bias or variance. For instance, if the training score meets performance expectations and the validation score exhibits significant improvement as the training sample grows, training for a longer lookback period or obtaining data might add value. If, on the other hand, both the validation and the training score converge to a similarly poor value, despite an increasing training set size, the error is more likely due to bias and additional training data is unlikely to help.

Support Vector Machines

Another very powerful and versatile tool that is capable of performing both linear and non-linear classification and regression is Support Vector Machine (SVM). They excel in scenarios where there is a clear margin of separation between different classes in the data. In classification, SVMs aim to find the hyperplane that best separates the classes in the feature space, maximizing the margin, which is the distance between the hyperplane and the nearest data points (support vectors) from each class. Think of an SVM classifier as fitting the widest possible “street” between the classes. This is known as large margin classification. This hyperplane effectively divides the feature space into regions, one for each class.

If we strictly impose that all instances be off the street and on the right side, this is known as hard margin classification. However, it comes with its own set of limitations. Firstly, it only works if the data is linearly separable and like CART, it is very sensitive to outliers. A way to avoid these issues is to find a good balance between keeping the street as large as possible and limiting the margin violations — instances that end up in the middle of the street or wrong side of the classification. This is what is known as soft margin classification.

An advantage of using SVM is that it is able to handle non-linear data sets. One way is to add more features such as polynomial features until it is linearly separable. To deal with even more complex data sets, you can use kernels. Kernels are functions used to transform input data into higher-dimensional spaces, where it may be easier to separate classes or capture nonlinear relationships between features. Kernels come in various types, including linear, polynomial, radial basis function (RBF), and sigmoid, each serving different purposes in transforming the input data. Linear kernels compute the dot product of input vectors, while polynomial kernels raise the dot product to a certain power, enabling SVMs to capture nonlinear relationships. RBF kernels transform input vectors into an infinite-dimensional space based on the Euclidean distance between them, useful for capturing complex relationships. Sigmoid kernels, similar to RBF kernels, use hyperbolic tangent functions and can be effective for learning non-linear decision boundaries. Kernels are essential for nonlinear mapping, enabling SVMs to capture complex relationships, and for efficient computation, allowing SVMs to work with data in higher-dimensional spaces without explicitly calculating the transformations. With so many kernels to choose from, which one should you use? A general guideline is to always start with the linear kernel first especially if the training set is very large with plenty of features, otherwise, a Gaussian RBF kernel should do just as well. There is no specific constraint to this.

For regression tasks, SVMs aim to find the hyperplane that best fits the data points while minimizing the error. This process is very similar to CART. Although classification finds the hyperplane that best separates the classes in the feature space and maximizes the margin, regression tasks on the other hand, tries to fit as many instances on the street while limiting margin violations.

The math behind SVM is rather complex as it iterates through a series of convex quadratic optimization problems with linear constraints when solving for the margins. Thankfully, libraries in Python such as SK-Learn handles that for you in a simple line of code. If you are interested in the math, you can find it in my Google Drive notes under Note3 — Kernel Methods.

Conclusion

This article delved into the foundational concepts of machine learning algorithms in supervised learning. We have covered essential aspects of machine learning that play a crucial role in model development and performance evaluation. Additional details such as Gradient Descent, Bias-Variance Trade-Off are some of the most foundational concepts in Machine Learning which are crucial to understand. While we have covered the modelling process in depth, it’s important to recognize that this is only the beginning and introduction to Machine Learning. Concepts like unsupervised learning, ensemble methods, and deep learning offer vast opportunities for further exploration and discovery. Stay tuned as I delve deeper into these topics. Thank you for joining me on this journey, and I hope you found this article as engaging and insightful to read as it was for me to write.

References

[1] Aurelien-Geron: Hands On Machine-Learning with Scikit Learn, Keras and Tensorflow: Concepts, Tools and Techniques to Build Intelligent Systems.

Machine Learning: Supervised Learning

Written by Jonathan Ho

No responses yet