Breast field cancerisation reprograms the epigenome of histologically normal peritumoral tissue before any morphological change, offering a window of opportunity for pre-morphological early cancer detection. Extracting reliable signals from DNA methylation data is complicated by cross-cohort signal inversion and a feature selection problem spanning approximately 485,000 candidate CpGs. We present an end-to-end pipeline that addresses both challenges within a single framework. Signal inversion is resolved by transforming each sample to its deviation from a cohort-specific Normal reference, followed by NeuroCombat harmonisation. First, we propose a multi-stage feature selection cascade that combines variability filtering, stability ranking, and genomic diversification. The cascade, resulting in a biologically-prioritised 5,000 CpG pool, is guided by a novel Field Progression Index (FPI) that scores each CpG by how far Adjacent tissue has shifted along the Normal–Tumour axis. A Mixed-Integer Linear Programme (MILP) then compresses this pool to a 30-CpG panel, maximising discriminative power whilst enforcing enrichment for COSMIC cancer driver genes. The panel achieves AUC = 0.981, retaining 99.6% of a 5,000-CpG baseline. Greedy and LASSO baselines matching this AUC recover zero driver genes, confirming that the MILP formulation is necessary for biological grounding. Fifteen of 30 CpGs map to 11 driver genes including ERBB2, BRCA2, and NOTCH1. Trained on Normal vs. Adjacent labels only, the panel transfers zero-shot to Normal vs. Tumour classification (AUC = 1.000).

Biologically-Constrained CpG Panel for Breast Field Cancerisation Classification / Roviera, Elisabetta; Gambino Vincenzo, Sandro; Benso, Alfredo. - (In corso di stampa). ( IWBBIO26 - 13th International Work-Conference on Bioinformatics and Biomedical Engineering Gran Canaria (ES) 14-17 Luglio 2026).

Biologically-Constrained CpG Panel for Breast Field Cancerisation Classification

Roviera Elisabetta;Benso Alfredo
In corso di stampa

Abstract

Breast field cancerisation reprograms the epigenome of histologically normal peritumoral tissue before any morphological change, offering a window of opportunity for pre-morphological early cancer detection. Extracting reliable signals from DNA methylation data is complicated by cross-cohort signal inversion and a feature selection problem spanning approximately 485,000 candidate CpGs. We present an end-to-end pipeline that addresses both challenges within a single framework. Signal inversion is resolved by transforming each sample to its deviation from a cohort-specific Normal reference, followed by NeuroCombat harmonisation. First, we propose a multi-stage feature selection cascade that combines variability filtering, stability ranking, and genomic diversification. The cascade, resulting in a biologically-prioritised 5,000 CpG pool, is guided by a novel Field Progression Index (FPI) that scores each CpG by how far Adjacent tissue has shifted along the Normal–Tumour axis. A Mixed-Integer Linear Programme (MILP) then compresses this pool to a 30-CpG panel, maximising discriminative power whilst enforcing enrichment for COSMIC cancer driver genes. The panel achieves AUC = 0.981, retaining 99.6% of a 5,000-CpG baseline. Greedy and LASSO baselines matching this AUC recover zero driver genes, confirming that the MILP formulation is necessary for biological grounding. Fifteen of 30 CpGs map to 11 driver genes including ERBB2, BRCA2, and NOTCH1. Trained on Normal vs. Adjacent labels only, the panel transfers zero-shot to Normal vs. Tumour classification (AUC = 1.000).
In corso di stampa
File in questo prodotto:
File Dimensione Formato  
Paper_IWBBIO_2026.pdf

accesso riservato

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 1.25 MB
Formato Adobe PDF
1.25 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3010467