Loading and inspecting complex Excel data; Handling missing data with imputation
Applying feature scaling via StandardScaler for normalization prior to PCA.
Performing Principal Component Analysis (PCA) to reduce dimensionality and identify key patterns in high-dimensional geochemical and mineralogical data.
Creating 2D and 3D scatterplots with color coding using matplotlib and seaborn.
Generating PCA variance explained bar plots, pairplots, cluster visualizations, boxplots, and correlation heatmaps to explore data structure and relationships.
Implementing KMeans clustering to identify groupings in PCA-reduced space.
I worked with a comprehensive dataset of regolith simulants, which are analog materials used to mimic the soil of different planetary bodies such as the Moon, Mars, and asteroids. The dataset includes mineralogical, chemical, and physical properties for various simulant samples.
My goal was to simplify this high-dimensional data using PCA to find patterns and relationships that differentiate simulants by their planetary origin. After carefully cleaning and imputing missing data, I standardized all numeric features and applied PCA to extract the principal components that capture most of the variance.
I then explored these components visually in both 2D and 3D, including cluster analyses, to interpret natural groupings and differences among simulants.
This figure presents the first two principal components of the regolith simulant dataset. Each point represents a simulant sample colored by its planetary body of origin, with acronyms annotated for sample identification. The plot highlights how different simulants group based on their mineralogical and chemical properties.
However, as is clear by the plot, there is significant overlap of these variables. In an attempt to more clearly visualize this data, I plotted it next in 3D PCA space.
Figure 1: PCA of Regolith Simulants (2D Scatterplot)
Figure 2: 3D PCA Scatterplot of Regolith Simulants
Displayed here is a three-dimensional scatterplot of the first three principal components, providing a more comprehensive visualization of variance and clustering in the dataset.
Principal Component Analysis (PCA) finds PCs as eigenvectors of the covariance matrix of the data.
PC1 is the eigenvector with the largest eigenvalue (most variance).
PC2 is the eigenvector with the second largest eigenvalue, orthogonal to PC1.
PC3 is the eigenvector with the third largest eigenvalue, orthogonal to both PC1 and PC2.
The table of PC1 and PC2 values can be found at the bottom of this page for reference.
Figure 3: Pairplot of First Three PCs. This, along with the heatmap below, demonstrate differences in distribution by correlation.
Figure 4: The above plots represent the PC1 and PC2 variables and the amount of variance explained by each. As expected, by definition, the PC1 accounts for more variance than PC2.
Figure 5: Correlation heatmap of raw data features (blank spots are representative of gaps in the data)