Machine Learning: Unsupervised Learning

Jonathan Ho
9 min readApr 12, 2024

--

Photo by Billy Huynh on Unsplash

Data holds secrets waiting for patterns to be discovered. In contrast to the structured guidance of supervised learning, unsupervised learning algorithms aim to uncover hidden structures and relationships within data without the need for labelled examples. There are no predefined target variables to guide the learning process. Working with datasets that lack explicit output labels, consisting only of input features, unsupervised learning algorithms autonomously identify patterns, structures, or relationships within the data. Common tasks include clustering, where data points are grouped into clusters based on similarity, and dimensionality reduction, which aims to reduce the number of features while preserving the essential information. Without explicit guidance, these algorithms explore the inherent structure of the data, uncovering hidden insights and organizing complex datasets into meaningful representations. Through this process, unsupervised learning enables data exploration, anomaly detection, and the discovery of valuable patterns that may not be apparent through manual inspection. From clustering similar data points to dimensionality reduction and anomaly detection, unsupervised learning offers a wealth of techniques that equips us to extract meaningful insights and make sense of complex datasets.

Welcome to the second article of my Machine Learning series. In this article, we will cover common algorithms in unsupervised learning. I highly recommend reading my first article on Machine Learning: Supervised Learning where I covered supervised learning and other key concepts. Without further ado, let us delve into the realm of unsupervised learning.

Dimensionality Reduction

One of the key concepts in unsupervised learning is reducing dimensionality. Principal Component Analysis (PCA) is by far the most popular dimensionality reducing algorithm. But why is it important? It is to tackle The Curse of Dimensionality problem illustrated below.

The Curse of Dimensionality

As we start to use high dimensional datasets, the available data becomes increasingly at risk of being very sparse, meaning that the data points will likely to be far away from one another and sparsely populated. This can lead to difficulties in accurately estimating statistical quantities, such as densities or distances between data points. Moreover, the computational complexity of algorithms tends to grow exponentially with the number of dimensions since they will be based on much larger extrapolations. This results in longer training times, higher memory requirements, and can render traditional methods impractical or infeasible for high-dimensional datasets.

Another consequence of the curse of dimensionality is the increased risk of overfitting. With a large number of dimensions, models become more prone to capturing noise or random fluctuations in the data rather than genuine patterns. This is particularly problematic when the number of features exceeds the number of observations in the dataset. Furthermore, visualizing high-dimensional data becomes increasingly challenging as the number of dimensions grows. While techniques like dimensionality reduction can help reduce the number of dimensions for visualization purposes, they may also lead to loss of information.

Principal Component Analysis

PCA first identifies the hyperplane that lies closest to the data and then it projects the data onto it. By choosing the right hyperplane, we are able to project the training set onto a lower-dimensional hyperplane which preserves the maximum variance so that it losses the least amount of information compared to other projections. Another way to justify this choice is that the axis that minimizes the mean squared distance between the original data set and its projection onto that axis.

Principal Component Analysis by Casay Cheng | Towards Data Science

As illustrated in the image above, PCA identifies the axis that accounts for the largest amount of variance in the training set. It also finds a second axis, orthogonal to the first one that accounts for the largest amount of remaining variance. For higher dimensional data sets, PCA would find a third axis which is orthogonal to both the first and second axis. It then finds a fourth, fifth and so on — as many axes as the number of dimensions in the data set.

The technique to find the principal components of a training set is to use a standard matrix factorization called Singular Value Decomposition (SVD) which decomposes the training set into the matrix multiplication of three matrices. The matrix, V, contains the vectors of the principal components.

Singular Value Decomposition Equation {ChatGPT}

Once all principal components have been identified, the dimensionality of the data set can now be reduced down to d dimensions by projecting it onto the hyperplane defined by the first d principal components. One way to choose the right number of dimensions is to plot an explained variance as a function of the number of dimensions. Using an elbow method, find the number of dimensions at which the explained variance stops exponentially increase. In my Supervised Learning article, I touched on Kernels — a mapping to high dimensional space, enabling non-linear classification and regression with Support Vector Machines where a linear decision boundary in a high-dimensional space corresponds to a complex non-linear decision boundary in the original space. The same can be applied to PCA where non-linear projections are used for dimensionality reduction for complex data sets.

Other dimensionality reduction techniques include Manifold Learning, t-SNE, Linear Discriminant Analysis which will not be covered in this article as there is already a lot of information to digest. More often than not, all these dimensionality reduction techniques are a preparation step for a supervised learning task.

Clustering

The task of identifying similar instances and forming these instances to similar groups (clusters) is what we know as clustering. Just like in classification, each instance gets assigned to a group but instead of having labelled data, clustering algorithms find patterns in the data set and makes good use of all the features to identify the clusters.

Different algorithms capture different kinds of clusters. Some look for instances centered around a particular point called a centroid. Ithers look for continuous regions of densely packed instances and some are hierarchical, looking for clusters within clusters.

K-Means Clustering

Suppose you are given the centroids: you could easily label all the instances in the data set by assigning each of them to the cluster whose centroid is closest. Conversely, if you were given all the instance labels, you could easily locate all the centroids by computing the mean of the instances for each cluster. In K-Means clustering, start placing the centroids randomly by picking k instances at random and using their locations as centroids, usually based on a distance metric, the Euclidean distance. After that, label the instances, update the centroids, label the instances and so on. Once all data points are assigned to clusters, the centroids of the clusters are updated by computing the mean of all data points assigned to each cluster. This new centroid becomes the center of its respective cluster. These assignment and update steps are repeated iteratively until one of the convergence criteria is met. This keeps repeating until the centroids stop moving. This algorithm is guaranteed to converge, though it may not converge to a local minimum (this is dependent on the centroid initialization).

The choice of the number of clusters, K, is a critical parameter in K-means clustering and can greatly impact the results. To find the optimal number of clusters we plot the inertia as a function of k. The more clusters there are, the closer each instance will be to its closest centroid and thus, the lower the inertia will be.

Selecting the number of clusters using the Elblow Rule

As illustrated above, using the Elbow Rule (similar to the one used in PCA for explained variance graph), we find k with the most drastic change. Any lower value would be dramatic while any higher value would not help much and we might be splitting clusters for no good reason. Although this is rather subjective, a more precise approach is to use the highest Silhouette Score, which is the mean Silhouette Coefficient over all the instances.

Other clustering algorithms include Agglomerative Clustering, Birch, Mean-Shift and Affinity Propagation.

Gaussian Mixtures

A Gaussian Mixture Model (GMM) is a probabilistic model that assumes the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. GMMs are particularly useful for modeling complex data distributions that may contain multiple overlapping clusters.

Gaussian Mixture Models Explained by Oscar Contreras Carrasco | Towards Data Science

The first step in training a GMM involves initializing the parameters of the model, including the means, covariances, and mixture coefficients. These parameters are typically initialized randomly or using techniques like K-means clustering.

The main algorithm used to train GMMs is the Expectation-Maximization (EM) algorithm, which iteratively maximizes the likelihood of the observed data with respect to the model parameters. For each instance, a cluster is picked randomly among k clusters. The probability of choosing the jth cluster is defined by the cluster’s weight. If the index of the cluster chosen that a random instance is equal to the probability of choosing the jth cluster, then the location of that instance is sampled randomly from the Gaussian distribution with a mean and a covariance matrix. This general overview can be broken down two steps. The Expectation (E) step and Maximization (M) step.

In the E-step, the algorithm computes the probability that each data point belongs to each Gaussian component, given the current parameter estimates. This is done using Bayes’ theorem and is represented by the responsibility of each component for each data point. In the M-step, the algorithm updates the parameters of the Gaussian components to maximize the likelihood of the observed data. This involves re-estimating the means, covariances, and mixture coefficients based on the responsibilities computed in the E-step. The E-step and M-step are repeated iteratively until convergence, where the change in the log-likelihood of the data between iterations falls below a predefined threshold. In the context of clustering, we can think of the EM algorithm as a generalization of K-Means which not only finds the cluster centers, but also their size, shape and orientation, together with their respective weights. Unlike the K-Means algorithm, EM uses soft cluster assignments rather than hard assignments. This means that for each instance during the E-step, the algorithm estimates the probability that it belongs to each cluster (based on the current cluster parameters. Then during the M-step, each cluster is updated using all the instances in the data set, with each instance weighted by the estimated probability that it belongs to that cluster.

Once the GMM is trained, it can be used to assign cluster labels to new data points by computing the probability that each point belongs to each Gaussian component. Additionally, the model’s performance can be evaluated using measures such as the log-likelihood of the data or through visualization of the data distribution.

Uses of GMM includes anomaly detection where any instance located in a low-density region can be considered an anomaly, once the density threshold is defined. Unlike K-Means, we cannot use inertia or silhouette scores to select the appropriate number of clusters as they are not reliable when the clusters are not spherical or have different sizes. Instead, theoretical information criterion such as the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) can be used. The model works best when AIC and BIC are minimized. Essentially, both the BIC and AIC penalize models that have more parameters to learn and reward models that fit the data well. Models selected by BIC tends to be simpler with fewer parameters but does not fit the data quite as well compared to using AIC as the criterion.

Conclusion

Unsupervised learning opens doors to uncovering hidden patterns, structures, and insights within unlabeled data, empowering us to make sense of complex datasets without explicit guidance. In this article, we’ve explored powerful techniques such as Principal Component Analysis (PCA), K-means clustering, and Gaussian Mixture Models (GMM), each offering unique perspectives and capabilities in extracting meaningful information from raw data. From dimensionality reduction with PCA to clustering and density estimation with K-means and GMM, these algorithms have proven invaluable across a myriad of domains with its wide use of application.

--

--

Jonathan Ho
Jonathan Ho

Written by Jonathan Ho

A 20 year old who is serving National Service, passionate about Quantitative Finance, Systematic Trading and Machine Learning.

No responses yet