Breast field cancerisation reprograms the epigenome of histologically normal peritumoral tissue before any morphological change, offering a window of opportunity for pre-morphological early cancer detection. Extracting reliable signals from DNA methylation data is complicated by cross-cohort signal inversion and a feature selection problem spanning approximately 485,000 candidate CpGs. We present an end-to-end pipeline that addresses both challenges within a single framework. Signal inversion is resolved by transforming each sample to its deviation from a cohort-specific Normal reference, followed by NeuroCombat harmonisation. First, we propose a multi-stage feature selection cascade that combines variability filtering, stability ranking, and genomic diversification. The cascade, resulting in a biologically-prioritised 5,000 CpG pool, is guided by a novel Field Progression Index (FPI) that scores each CpG by how far Adjacent tissue has shifted along the Normal–Tumour axis. A Mixed-Integer Linear Programme (MILP) then compresses this pool to a 30-CpG panel, maximising discriminative power whilst enforcing enrichment for COSMIC cancer driver genes. The panel achieves AUC = 0.981, retaining 99.6% of a 5,000-CpG baseline. Greedy and LASSO baselines matching this AUC recover zero driver genes, confirming that the MILP formulation is necessary for biological grounding. Fifteen of 30 CpGs map to 11 driver genes including ERBB2, BRCA2, and NOTCH1. Trained on Normal vs. Adjacent labels only, the panel transfers zero-shot to Normal vs. Tumour classification (AUC = 1.000).
Biologically-Constrained CpG Panel for Breast Field Cancerisation Classification / Roviera, Elisabetta; Gambino Vincenzo, Sandro; Benso, Alfredo. - (In corso di stampa). ( IWBBIO26 - 13th International Work-Conference on Bioinformatics and Biomedical Engineering Gran Canaria (ES) 14-17 Luglio 2026).
Biologically-Constrained CpG Panel for Breast Field Cancerisation Classification
Roviera Elisabetta;Benso Alfredo
In corso di stampa
Abstract
Breast field cancerisation reprograms the epigenome of histologically normal peritumoral tissue before any morphological change, offering a window of opportunity for pre-morphological early cancer detection. Extracting reliable signals from DNA methylation data is complicated by cross-cohort signal inversion and a feature selection problem spanning approximately 485,000 candidate CpGs. We present an end-to-end pipeline that addresses both challenges within a single framework. Signal inversion is resolved by transforming each sample to its deviation from a cohort-specific Normal reference, followed by NeuroCombat harmonisation. First, we propose a multi-stage feature selection cascade that combines variability filtering, stability ranking, and genomic diversification. The cascade, resulting in a biologically-prioritised 5,000 CpG pool, is guided by a novel Field Progression Index (FPI) that scores each CpG by how far Adjacent tissue has shifted along the Normal–Tumour axis. A Mixed-Integer Linear Programme (MILP) then compresses this pool to a 30-CpG panel, maximising discriminative power whilst enforcing enrichment for COSMIC cancer driver genes. The panel achieves AUC = 0.981, retaining 99.6% of a 5,000-CpG baseline. Greedy and LASSO baselines matching this AUC recover zero driver genes, confirming that the MILP formulation is necessary for biological grounding. Fifteen of 30 CpGs map to 11 driver genes including ERBB2, BRCA2, and NOTCH1. Trained on Normal vs. Adjacent labels only, the panel transfers zero-shot to Normal vs. Tumour classification (AUC = 1.000).| File | Dimensione | Formato | |
|---|---|---|---|
|
Paper_IWBBIO_2026.pdf
accesso riservato
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
1.25 MB
Formato
Adobe PDF
|
1.25 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3010467
