In a series of recent works, autoencoders have been successfully applied to the clustering of Surface-Enhanced Raman Spectroscopy (SERS) spectra. While various architectures have shown promise, autoencoders like VAEs and GRUs have shown limitations due to their structure not being optimal for Raman spectra. To rigorously validate the chemical logic of this approach, we developed an independent benchmark using a clustering analysis of the same metabolites based on their molecular structure, represented by SMILES fingerprints. In this work, we present a comprehensive comparison of three autoencoder architectures - Conv1D, Dense, and Transformer - against this structural benchmark. While all three models successfully replicated key chemical groupings, such as the tryptophan family, the most significant findings emerged from the insightful divergences between the methods. These divergences highlight the unique, vibrationally-driven logic of the SERS analysis, which can identify relationships that are not apparent from a simple structural comparison. The Conv1D autoencoder consistently provided the most chemically intuitive and robust clustering. It excelled at creating clean, high-resolution clusters for chemical families and correctly identifying unique spectral outliers, like lipoamide, which the other models failed to isolate. Our findings demonstrate that while the Dense and Transformer models provide valuable insights, the Conv1D model is the clear winner for this application, striking the best balance between validating traditional chemical knowledge and revealing new, subtle relationships in SERS data.
Bridging Structure and Spectra: A Comparative K-Clustering Analysis of Metabolites from SMILES and SERS Data / Sparavigna, Amelia Carolina. - ELETTRONICO. - (2025). [10.5281/zenodo.17052624]
Bridging Structure and Spectra: A Comparative K-Clustering Analysis of Metabolites from SMILES and SERS Data
Sparavigna, Amelia Carolina
2025
Abstract
In a series of recent works, autoencoders have been successfully applied to the clustering of Surface-Enhanced Raman Spectroscopy (SERS) spectra. While various architectures have shown promise, autoencoders like VAEs and GRUs have shown limitations due to their structure not being optimal for Raman spectra. To rigorously validate the chemical logic of this approach, we developed an independent benchmark using a clustering analysis of the same metabolites based on their molecular structure, represented by SMILES fingerprints. In this work, we present a comprehensive comparison of three autoencoder architectures - Conv1D, Dense, and Transformer - against this structural benchmark. While all three models successfully replicated key chemical groupings, such as the tryptophan family, the most significant findings emerged from the insightful divergences between the methods. These divergences highlight the unique, vibrationally-driven logic of the SERS analysis, which can identify relationships that are not apparent from a simple structural comparison. The Conv1D autoencoder consistently provided the most chemically intuitive and robust clustering. It excelled at creating clean, high-resolution clusters for chemical families and correctly identifying unique spectral outliers, like lipoamide, which the other models failed to isolate. Our findings demonstrate that while the Dense and Transformer models provide valuable insights, the Conv1D model is the clear winner for this application, striking the best balance between validating traditional chemical knowledge and revealing new, subtle relationships in SERS data.File | Dimensione | Formato | |
---|---|---|---|
metabolites.pdf
accesso aperto
Tipologia:
1. Preprint / submitted version [pre- review]
Licenza:
Creative commons
Dimensione
505.17 kB
Formato
Adobe PDF
|
505.17 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3002787