Dima's Blog

Dimensionality Reduction Learning Path. Part I: Linear Dimensionality Reduction

The world is complex, but that doesn't mean we don't want to describe it. Even though we live in a 3D world and can easily understand 2D or 3D patterns when visualizing, most objects of interest rarely can be described using just two or three properties. Take something as familiar as personal fitness: gender, age, height, blood pressure, you name it.

I've been curious about dimensionality reduction for a while, and I’d like to share my learning journey in case it helps others get up to speed more quickly than I did. Hopefully, this post will help you navigate the landscape of dimensionality reduction techniques more easily.

Note: I'm not a professional mathematician—I explore math for fun. I realize that someone with advanced knowledge in these areas might spot flaws or errors in my work. As an engineering practitioner and math enthusiast, my goal is simply to understand a concept and build a prototype to see if my high-level understanding covers all the key components.

Linear Dimensionality Reduction

No matter which method you choose to reduce the dimensionality of your data, remember that **you're starting with the assumption that patterns exist within that data***. Without any underlying pattern, there's nothing to leverage for dimensionality reduction.

When it comes to linear dimensionality reduction, you're taking that assumption a step further by believing that the pattern is not only present but also linear. For example, if you're working in 3D space, this means your data points lie on a 2D plane. More generally, your data lies on a hyperplane within a higher-dimensional space.

The most common linear dimensionality reduction technique—and the only one I've studied—is Principal Component Analysis, or PCA for short. To make PCA practical, it requires a computational method. There are several available, but I’ve only explored by far the most common one: Singular Value Decomposition, or SVD.

I know, it can feel like a lot of concepts and abstractions all at once. But don’t worry—you don’t need to take a full statistics course or feel overwhelmed. Instead, let’s take an hour or so to build a basic understanding of the concepts, then move on to the part we all enjoy most: applying it in practice:

Now, let's dive into using scikit-learn. Here’s the notebook you can follow, complete with clear text descriptions.

Hopefully, you now feel confident using it on your own dataset. 😎