Medical datasets are usually affected by several problems, such as missing values, inconsistencies, redundancies, that can influence the data mining process and the extraction of useful knowledge. For these reasons, a preprocessing phase should be performed for improving the overall quality of data and, consequently, of the information that may be discovered from them. In this study we applied five steps of data preprocessing to improve the quality of a large dataset derived from a multicenter clinical trial. Our dataset included 298 patients enrolled in a prospective, multicenter, clinical trial, characterized by 22 input variables and one class variable (MIPI value). In particular, data coming from different medical centers were firstly integrated to obtain a homogeneous dataset. The latter was normalized to scale all variables into smaller and similar intervals. Then, all missing values were estimated by means of an imputation step. The complete dataset was finally discretized and reduced to remove redundant variables and decrease the amount of data to be managed. The improvement of data quality after each step was evaluated by means of the patients’ classification accuracy using the KNN classifier. Our results showed that the proposed pipeline produced an increment of more than 20% of the classification performances. Moreover, the highest growth of accuracy was obtained after missing value imputation, whereas the discretization and feature selection steps allowed for a significant reduction of variables to be managed, without any deterioration of the information contained in data.
Data Quality Improvement of a Multicenter Clinical Trial Dataset / Zaccaria, GIAN MARIA; Rosati, Samanta; Castagneri, Cristina; Ferrero, Simone; Ladetto, Marco; Boccadoro, Mario; Balestra, Gabriella. - ELETTRONICO. - (2017), pp. 1190-1193. (Intervento presentato al convegno 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC’17) tenutosi a Jeju, Korea nel July 11-15, 2017) [10.1109/EMBC.2017.8037043].
Data Quality Improvement of a Multicenter Clinical Trial Dataset
ZACCARIA, GIAN MARIA;ROSATI, SAMANTA;CASTAGNERI, CRISTINA;BALESTRA, GABRIELLA
2017
Abstract
Medical datasets are usually affected by several problems, such as missing values, inconsistencies, redundancies, that can influence the data mining process and the extraction of useful knowledge. For these reasons, a preprocessing phase should be performed for improving the overall quality of data and, consequently, of the information that may be discovered from them. In this study we applied five steps of data preprocessing to improve the quality of a large dataset derived from a multicenter clinical trial. Our dataset included 298 patients enrolled in a prospective, multicenter, clinical trial, characterized by 22 input variables and one class variable (MIPI value). In particular, data coming from different medical centers were firstly integrated to obtain a homogeneous dataset. The latter was normalized to scale all variables into smaller and similar intervals. Then, all missing values were estimated by means of an imputation step. The complete dataset was finally discretized and reduced to remove redundant variables and decrease the amount of data to be managed. The improvement of data quality after each step was evaluated by means of the patients’ classification accuracy using the KNN classifier. Our results showed that the proposed pipeline produced an increment of more than 20% of the classification performances. Moreover, the highest growth of accuracy was obtained after missing value imputation, whereas the discretization and feature selection steps allowed for a significant reduction of variables to be managed, without any deterioration of the information contained in data.File | Dimensione | Formato | |
---|---|---|---|
Data Quality Improvement_EMBC.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
665.49 kB
Formato
Adobe PDF
|
665.49 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2677587
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo