In the analysis of any type of system, granting maximum information extraction from its data is non-trivial. Confidence in successful information extraction typically builds on prior knowledge of the studied system or on the user's experience. However, a robust and objective criterion for ensuring maximum information extraction from data is difficult to define. Here, we introduce a data-driven approach that employs Shannon entropy as a transferable metric to assess and quantify Maximum Information Extraction (MInE) from data via their clustering into statistically-relevant micro-domains. The method is general and can be applied virtually to any type of data or system. We demonstrate its efficiency by analyzing, as a first example, time-series data extracted from molecular dynamics simulations of water and ice coexisting at the solid/liquid transition temperature. The method allows quantifying the information contained in the data distributions (time-independent component) and the additional information gain attainable by analyzing data as time-series (i.e., accounting for the information contained in data time-correlations). The different micro-domains that can be effectively resolved and classified in the system are characterized by own entropy, which are found consistent with experimentally known thermodynamic parameters. A second test case demonstrates how the MInE approach is also effective for high-dimensional datasets and clearly shows how including little informative, but noisy, extra components/features in high-dimensional analyses may be not only useless, but even detrimental to maximum information extraction. This provides a robust parameter-free approach and quantitative metrics for data-analysis, and for the study of any type of system from its data.
Maximum information extraction via clustering and minimization of Shannon entropy / Becchi, Matteo; Pavan, Giovanni M.. - In: MACHINE LEARNING: SCIENCE AND TECHNOLOGY. - ISSN 2632-2153. - (2025). [10.1088/2632-2153/ae2dbb]
Maximum information extraction via clustering and minimization of Shannon entropy
Becchi, Matteo;Pavan, Giovanni M.
2025
Abstract
In the analysis of any type of system, granting maximum information extraction from its data is non-trivial. Confidence in successful information extraction typically builds on prior knowledge of the studied system or on the user's experience. However, a robust and objective criterion for ensuring maximum information extraction from data is difficult to define. Here, we introduce a data-driven approach that employs Shannon entropy as a transferable metric to assess and quantify Maximum Information Extraction (MInE) from data via their clustering into statistically-relevant micro-domains. The method is general and can be applied virtually to any type of data or system. We demonstrate its efficiency by analyzing, as a first example, time-series data extracted from molecular dynamics simulations of water and ice coexisting at the solid/liquid transition temperature. The method allows quantifying the information contained in the data distributions (time-independent component) and the additional information gain attainable by analyzing data as time-series (i.e., accounting for the information contained in data time-correlations). The different micro-domains that can be effectively resolved and classified in the system are characterized by own entropy, which are found consistent with experimentally known thermodynamic parameters. A second test case demonstrates how the MInE approach is also effective for high-dimensional datasets and clearly shows how including little informative, but noisy, extra components/features in high-dimensional analyses may be not only useless, but even detrimental to maximum information extraction. This provides a robust parameter-free approach and quantitative metrics for data-analysis, and for the study of any type of system from its data.| File | Dimensione | Formato | |
|---|---|---|---|
|
accepted-paper.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Creative commons
Dimensione
2.52 MB
Formato
Adobe PDF
|
2.52 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3005939
