In the analysis of any type of system, granting maximum information extraction from its data is non-trivial. Confidence in successful information extraction typically builds on prior knowledge of the studied system or on the user's experience. However, a robust and objective criterion for ensuring maximum information extraction from data is difficult to define. Here, we introduce a data-driven approach that employs Shannon entropy as a transferable metric to assess and quantify Maximum Information Extraction (MInE) from data via their clustering into statistically-relevant micro-domains. The method is general and can be applied virtually to any type of data or system. We demonstrate its efficiency by analyzing, as a first example, time-series data extracted from molecular dynamics simulations of water and ice coexisting at the solid/liquid transition temperature. The method allows quantifying the information contained in the data distributions (time-independent component) and the additional information gain attainable by analyzing data as time-series (i.e., accounting for the information contained in data time-correlations). The different micro-domains that can be effectively resolved and classified in the system are characterized by own entropy, which are found consistent with experimentally known thermodynamic parameters. A second test case demonstrates how the MInE approach is also effective for high-dimensional datasets and clearly shows how including little informative, but noisy, extra components/features in high-dimensional analyses may be not only useless, but even detrimental to maximum information extraction. This provides a robust parameter-free approach and quantitative metrics for data-analysis, and for the study of any type of system from its data.

Maximum information extraction via clustering and minimization of Shannon entropy / Becchi, Matteo; Pavan, Giovanni M.. - In: MACHINE LEARNING: SCIENCE AND TECHNOLOGY. - ISSN 2632-2153. - (2025). [10.1088/2632-2153/ae2dbb]

Maximum information extraction via clustering and minimization of Shannon entropy

Becchi, Matteo;Pavan, Giovanni M.
2025

Abstract

In the analysis of any type of system, granting maximum information extraction from its data is non-trivial. Confidence in successful information extraction typically builds on prior knowledge of the studied system or on the user's experience. However, a robust and objective criterion for ensuring maximum information extraction from data is difficult to define. Here, we introduce a data-driven approach that employs Shannon entropy as a transferable metric to assess and quantify Maximum Information Extraction (MInE) from data via their clustering into statistically-relevant micro-domains. The method is general and can be applied virtually to any type of data or system. We demonstrate its efficiency by analyzing, as a first example, time-series data extracted from molecular dynamics simulations of water and ice coexisting at the solid/liquid transition temperature. The method allows quantifying the information contained in the data distributions (time-independent component) and the additional information gain attainable by analyzing data as time-series (i.e., accounting for the information contained in data time-correlations). The different micro-domains that can be effectively resolved and classified in the system are characterized by own entropy, which are found consistent with experimentally known thermodynamic parameters. A second test case demonstrates how the MInE approach is also effective for high-dimensional datasets and clearly shows how including little informative, but noisy, extra components/features in high-dimensional analyses may be not only useless, but even detrimental to maximum information extraction. This provides a robust parameter-free approach and quantitative metrics for data-analysis, and for the study of any type of system from its data.
File in questo prodotto:
File Dimensione Formato  
accepted-paper.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: Creative commons
Dimensione 2.52 MB
Formato Adobe PDF
2.52 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3005939