Techniques for Dimensionality Reduction

A guide for employing dimensionality reduction to improve an AI model’s efficacy

Matthew Connelly
Towards Data Science

--

Photo by Gant Laborde

Currently, we’re on the edge of a wonderful revolution: Artificial Intelligence. In addition to this, the recent ‘Big Bang’ in large datasets across companies, organisation, and government departments has resulted in a large uptake in data mining techniques. So, what is data mining? Simply put, it’s the process of discovering trends and insights in high-dimensionality datasets (those with thousands of columns). On the one hand, the high-dimensionality datasets have enabled organisations to solve complex, real-world problems, such as reducing cancer patient waiting time, predicting protein structure associated with COVID-19, and analysing MEG brain imaging scans. However, on the other hand, large datasets can sometimes contain columns with poor-quality data, which can lower the performance of the model — more isn’t always better.

One way to preserve the structure of high-dimensional data in a low-dimensional space is to use a dimensional reduction technique. So, what’s the benefit of this? The answer is three-fold: first, it improves the model accuracy due to less misleading data; second, the model trains faster since it has fewer dimensions; and finally, it makes the model simpler for researchers to interpret. There are three main dimensional reduction techniques: (1) feature elimination and extraction, (2) linear algebra, and (3) manifold. Over the course of this article, we’ll look at a strategy for implementing dimensionality reduction into your AI workflow, explore the different dimensionality reductions techniques, and work through a dimensionality reduction example.

How the best breakaway from the rest: a new strategy for AI modelling

The generation of simple experimentation with analytics is over, and most organisation know it. Therefore, companies should place advanced analytics at the heart of their organisation, providing the information and insights required to build an effective, efficient, and successful company. Furthermore, companies should strive to become insight-driven organisations, supporting an informed and capable workforce. Ease to say, difficult to master.

Currently, most organisations operate large business and operation reporting functions, which supplies traditional and periodic reporting across the company. However, typically companies face challenges, such as maintaining data quality, a single version of the truth, and consistency of assumptions. Therefore, it’s vital that organisation address these challenges before attempting to implement large-scale artificial intelligence function. But, in the short-term, they could consider producing proof-of-concepts to create excitement amongst senior stakeholders about the benefits of advanced analytics, which would help the Chief Data Officer (CDO) push for greater funding to enhance the data literacy and quality across the organisation.

For the purpose of this article, we’re going to assume that you have an acceptable data-quality to undertake the more complex analytics techniques. There are three key stages an analyst should take to produce an AI model, including understanding the bigger picture, cleansing the data, and deploying the model. The dimensionality reduction technique comes into the cleansing stage of the process. Note, however, it’s vital that analysts understand the purpose of their analysis; otherwise, they may use their time inefficiently, or worse, produce a model that doesn’t meet the needs of the stakeholders.

Therefore, to produce, monitor, and maintain a production-ready model, organisation should work through the following stages: (1) produce a set of user stories, (2) gather the data, (3) verify the data, (4) consider the ethics associated with deploying your model, (5) leverage a range of dimensionality reduction techniques, (6) model the data, (7) evaluate the model, and (8) deploy model.

Most dimensionality reduction techniques fall into one of three categories: feature extraction and elimination, linear algebra, and manifold

Feature extraction and elimination

The first stage within the dimensionality reduction process is feature extraction and elimination, which process of selecting a subset of columns for use in the model. A few of the common feature extraction and elimination techniques include:

· Missing values ratio. Columns with too many missing values will unlikely add additional value to a machine learning model. Therefore, when a column exceed a given threshold for missing values it can be excluded for the training set.

· Low-variance filter. Columns that have a small variance are unlikely to add as much value to a machine learning model. Thus, when a column goes below a given threshold for variance it can be excluded from the training set.

· High-correlation filter. If, for example, multiple columns contain similar trends, then it’s enough to feed the machine learning algorithm just one of the columns. To identify these columns, an analyst can use a Pearson’s Product Momentum Coefficient.

· Random forest. One way to eliminate features is to use a random forest technique, which creates, against the target attributes, a decision tree, and then leverage the usages statistics to identify the most informative subset of features.

· Backwards-feature elimination. The backwards feature elimination, a top down approach, starts off with all the features within the dataset, progressively removes one feature at a time, until the algorithm has reached the maximum tolerable error.

· Forward-feature construction. The forward feature construction, unlike the backwards feature elimination technique, takes a bottom up approach, where it starts with one feature, progressively adding the next feature with the highest increase in performance.

Linear algebra methods

The most well-known dimensionality-reduction technique are ones that implement the linear transformation, such as:

· Principal component analysis (PCA). PCA, an unsupervised machine learning algorithm, reduces the dimensions of a dataset whilst retaining as much information as possible. To do this, the algorithm creates a new set of features from an existing set of features. Note, however, to avoid a feature with large values dominating the results, all variables should be on the same scale. In Python’s scikit-learn, to achieve this, you can use the ‘StandardScaler’ function to ensure all of the variables are on the same scale.

· Linear Discriminatory Analysis (LDA). LDA, a supervised technique, seeks to retain as much as possible of the discriminatory power for the dependent variables. To do this, first, the LDA algorithm computes the separability between classes; second, it computes the distance between the sample of each class and the mean; and lastly, it produces the dataset in a lower-dimensionality space.

· Singular Value Composition. SVD extracts the most important features from the dataset. This method is particularly popular because it’s based on simple, interpretable linear algebra models.

Manifold

One approach to non-linear dimensionality reduction is manifold learning. So, what is Manifold learning? Simply put, manifold learning, using geometric properties, projects points into a lower dimensional space whilst preserving its structure. A few of the common manifold learning techniques, include:

· Isomap embedding. Isomaps attempts to preserve the relationships within the dataset by producing an embedded dataset. To achieve this, isomaps begin with producing neighborhood network. Next, it estimates the geodesic distance, the shortest path between two points on a curved surface, between all pairs of points. And lastly, using eigenvalue decomposition of the geodesic distance matrix, it identifies a low-dimensional embedding of the dataset.

· Locally linear embedding (LLE). Also, like isomaps, LLE attempts to preserve the relationship within the dataset by producing an embedded dataset. To do this, first, it finds the k-nearest neighbours (kNN) of the points; second, it estimates each data vector as a combination of it’s kNN; and lastly, it creates low-dimensional vectors that best reproduce these weights. There are two benefits of this algorithm: first, LLE is able to detect more features that the linear algebra methods; and second, in comparison to other algorithms, it’s more efficient.

· t-Distributed Stochastic Neighbour. t-SNE is particularly sensitive to local structures. This is approach is one of the best for visualisation purposes, and it’s helpful for understanding theoretical properties of a dataset. Note, however, it’s one of the most computationally expensive approaches, and other techniques, such as missing values ratio, should be used before applying this technique. Also, all the features should be scaled before applying this technique.

No single dimensionality reduction technique consistently provides the ‘best’ results. Therefore, the data analysts should explore a range of options and combinations of different dimensionality-reduction techniques, so they move their model closer to the optimal solution.

In this worked example, we’ll explore how Principal Component Analysis (PCA) can be used to reduce the dimensions of a dataset, whilst retaining the important features

For the following example, we’re going to use the well-known ‘Iris’ dataset (see Table 1), which has been provided by the UCI machine learning repository. This dataset has, from three different species, 150 flowers. The dataset has three unique classes: (1) Iris-setosa; (2) Iris-versicolour; and (3) Iris-virginica. Also, it has four unique features: (1) sepal length; (2) sepal width; (3) petal length; and (4) petal width. To do this, we’re going to import the dataset using pandas, then drop the blank rows (see Figure 1).

Figure 1: importing the ‘Iris’ dataset

Once this is done, you should see the following data (see Table 1).

Table 1: top five rows of the ‘Iris’ dataset

After this, we’re going to assign the first four feature columns to the ‘X’ variable (from left to right), then we’re going to assign the class column, the column furthest to the right, to variable ‘y’ (see Figure 2).

Figure 2: assigning the x and y variable

Datasets frequently contain numerical values that have different features with varying units, such height (m) and weight (kg). However, a machine learning algorithm would place additional emphasis on the weight feature instead of the height feature, because the weight values are larger than the height. But we want to ensure that the machine learning algorithm treats each column with equal importance. So, how do we do this? One way is to scale the features using a technique called standardisation.

Therefore, in this example, we’re going to apply the ‘StandardScaler’ function built into scikit-learn, so equal importance is placed on each column (see Figure 3). Consequently, the standardisation will ensure that each feature has a mean of zero and a standard deviation of one.

Figure 3: applying a standard scaler to the ‘Iris’ dataset

Then, we’re going to import, initialise, and fit the PCA algorithm, built into scikit-learn, to the ‘X’ variable (see Figure 4), which we defined earlier.

Figure 4: applying PCA to the ‘Iris’ dataset

Now, to get an understanding of how the classes are spread across the features, let’s produce a few histograms (see Figure 5).

Figure 5: plotting a histogram for each numerical column by iris class

After running the code above (see Figure 5), you should see a series graphs similar to the set below (see Figure 6).

Figure 6: histogram of each numerical column by iris class

Next, we’re going to call the scikit-learn’s PCA function, then visualise the resulting principal components via a scatterplot (see Figure 7).

Figure 7: scatterplot of the two principal components

Lastly, we can see the PCA algorithm has effectively and efficiently grouped our three unique classes across two principle components (see Figure 8); the two principal components have the most spread. In the graph below we can see three distinct clusters, which are now in a format that are better for an AI algorithm to process.

Figure 8: scatter plot of the two principal components

Organisation should ensure dimensionality reduction is included within their AI workflow

In the generation of Advanced Analytics, where more data is automatically considered better, we have rediscovered the value of removing outliers, low-quality data, and missing values from the dataset to improve the model’s accuracy and time-to-train. Although there is no silver bullet for reducing the dimensions of a dataset, a data analyst should consider employing and experimenting with a combination of different feature extraction and elimination, linear algebra, and manifold techniques in order to optimise the algorithm’s efficacy.

Bibliography

Judy, R. (2019) A Beginner’s Guide to Dimensionality Reduction in Machine Learning. Towards Data Science. Available from: https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e [Accessed 15 April 2020].

Prerna, S. (2020) Dimensionality Reduction Approaches. Towards Data Science. Available from: https://towardsdatascience.com/dimensionality-reduction-approaches-8547c4c44334 [Accessed 15 April 2020].

Hinton, G. and Maaten, L. (2008) Visualizing Data Using t-SNE. Toronto University. Available from: http://www.cs.toronto.edu/~hinton/absps/tsne.pdf [Accessed 16 April 2021].

Dataman. (2019) Dimension Reduction Techniques with Python. Towards Data Science. Available from: https://towardsdatascience.com/dimension-reduction-techniques-with-python-f36ca7009e5c [Accessed 16 April 2021].

Wattenberg, et al. (2016) How to Use t-SNE Effectively. Distill. Available from: https://distill.pub/2016/misread-tsne/ [Accessed 16 April 2021].

Dataman. (2019) Anomaly Detection with Autoencoders Made Easy. Towards Data Science. Available from: https://towardsdatascience.com/anomaly-detection-with-autoencoder-b4cdce4866a6 [Accessed 26 April 2021].

Fefferman, C. et al. (2016) Testing the Manifold Hypothesis. Journal of the American Mathematical Society. Available from: http://www.mit.edu/~mitter/publications/121_Testing_Manifold.pdf [Accessed 16 April 2021].

Belkin, M. and Niyogi, P. (2013) Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. MIT. Available from: http://www2.imm.dtu.dk/projects/manifold/Papers/Laplacian.pdf [Accessed 16 April 2021].

Silipo, R. (2015) Seven Techniques for Data Dimensionality Reduction. KD Nuggets. Available from: https://www.kdnuggets.com/2015/05/7-methods-data-dimensionality-reduction.html [Accessed 17 April 2021].

Brownlee, J. (2020) Six Dimensionality Reduction Algorithms with Python. Machine Learning Mastery. Available from: https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/ [Accessed 17 April 2021].

Scikit Learn (n.d.) Manifold Learning. Scikit Learn. Available from: https://scikit-learn.org/stable/modules/manifold.html [Accessed 17 April 2021].

Brems, M. (2017) A One-Stop Shop for Principal Component Analysis. Available from: https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c [Accessed 17 April 2021].

University of Duke. (n.d.) Linear Regression Models. Available from: http://people.duke.edu/~rnau/testing.htm [Accessed 17 April 2021].

Powell, V. (n.d.) Principal Component Analysis: Explained Visually. Available from: https://setosa.io/ev/principal-component-analysis/ [Accessed 17 April 2021].

Gormley, M. (2017) PCA + Neural Network. Available from: https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture18-pca.pdf [Accessed 17 April 2021].

Mohamed, O. (2021) The Power of Democracy in Feature Selection. Available from: https://towardsdatascience.com/the-power-of-democracy-in-feature-selection-dfb75f970b6e [Accessed 03 May 2021].

Palaniappan, V. (2021) Manifold Learning: The Theory Behind it. Available from: https://towardsdatascience.com/manifold-learning-the-theory-behind-it-c34299748fec [Accessed 03 May 2021].

Desarda, A. (2021) Getting Data Ready for Modelling: Feature engineering, Feature Selection, Dimension Reduction (Part Two). Available from: https://towardsdatascience.com/getting-data-ready-for-modelling-feature-engineering-feature-selection-dimension-reduction-39dfa267b95a [Accessed 03 May 2021].

Raschka, S. (2015) Principle Component Analysis in Three Simple Steps. Available from: https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html [Accessed 03 May 2021].

--

--

Strategy Consultant at Monitor Deloitte | my views are independent of my employer