ECCOMAS 2024

Application of Dimensionality Reduction Techniques for Classification in Multidimensional Data Sets

Souza Oliveira, Francisco Bruno (Universidade Estadual de Santa Cruz)
Nascimento Neves, Carla (Laboratório Nacional de Computação Científica)
Ambrósio, Paulo Eduardo (Universidade Estadual de Santa Cruz)

In session: MS180A - Predictive AI Modelling for Multi-Physics Problems: Methods, Algorithms and Challenges I

Please login to view abstract download link

Multidimensional data sets frequently contain noise and are susceptible to the curse of dimensionality, which can lead to challenges such as overfitting and reduced efficacy of data analysis. To address these challenges, Dimensionality Reduction involves transforming high-dimensional data into a lower-dimensional subspace. This mitigates the impact of noise and overfitting. [1, 2]. This study investigates the performance of dimensionality reduction techniques in multidimensional databases, aiming to identify which methods are most effective for different datasets. Six sets of data, each with distinct characteristics regarding attribute types and sample sizes, were selected for analysis. Thirteen commonly used dimensionality reduction methods from the literature were applied to these datasets. The raw datasets were subjected to the Random Forest classification algorithm, providing a baseline for classifier performance. The subsets generated by dimensionality reduction methods were then evaluated using the same classifier. In the first database, KPCA produced a subset that increased accuracy from 79.6% to 90.9% and the F-measure from 79.5% to 91.0%. The second database, with the application of CHI, achieved a classification accuracy of 97.0% using only 76.7% of the original attributes. The third database saw an accuracy increase from 81.1% to 83.7%, and the F-measure improved from 81.1% to 83.6% with SBS. In the fourth database, DFT yielded the best result, increasing accuracy from 70.1% to 74.8% and the F-measure from 70.0% to 74.4%. For the fifth database, the SFS method produced a subset with improved accuracy and True-Positive rates, going from 83.1% to 87.8%, and reducing the False-Positive rate from 16.6% to 13.0%. In the sixth database, both CHI and RFE methods achieved results similar to the original classification while utilizing only 70.6% of their attributes.