A New Synthetic Oversampling Method Using Ontology and Feature Selection in Order to Improve Imbalanced Textual Data Classification in Persian Texts
Abstract
Ever-growing extension of textual data has increased the necessity of processing textual data. Data imbalance in classification of textual data is one of the cases that decrease efficiency. In order to confront with imbalance problem, various methods are suggested. Some of the methods are: data-based, cost-based, algorithm-based and feature selection methods. In recent researches, some methods are considered into account using ensemble methods. In this research, a new oversampling method is suggested. In the new method the number of minor class samples is increased using ontology and then random oversampling is performed for minor class. Finally, using the methods of feature selection, appropriate features are selected. New ensemble method was tested using Hamshahri data. The results show that the ensemble method on Hamshahri collection, despite decreasing number of features, causes the improvement of classification results for polynomial Naïve Bayes and decision tree.