Volume 40, Number 6, December 2022
|1261 - 1268
|10 February 2023
A feature extraction method for small sample data based on optimal ensemble random forest
School of Mechanical Engineering, Northwestern Polytechnical University, Xi'an 710072, China
High dimensional small sample data is the difficulty of data mining. When using the traditional random forest algorithm for feature selection, it is to have the poor stability and low accuracy of feature importance ranking caused by over fitting of classification results. Aiming at the difficulties of random forest in the dimensionality reduction of small sample data, a feature extraction algorithm ote-gwrffs is proposed based on small sample data. Firstly, the algorithm expands the samples based on the generated countermeasure network Gan to avoid the over fitting phenomenon of traditional random forest in the small sample classification. Then, on the basis of data expansion, the optimal tree set algorithm based on weight is adopted to reduce the impact of data distribution error on feature extraction accuracy and improve the overall stability of decision tree set. Finally, the weighted average of the weight and feature importance measure of a single decision tree is used to obtain the feature importance ranking, which solves the problem of low accuracy and poor stability in the feature selection process of small sample data. Through the UCI data set, the present algorithm is compared with the traditional random forest algorithm and the weight based random forest algorithm. The ote-gwrffs algorithm has higher stability and accuracy for processing high-dimensional and small sample data.
高维小样本数据作为数据挖掘的难点, 用传统的随机森林算法进行特征选择时极易出现分类结果过拟合而导致的特征重要度排序稳定性差、精度低等问题。针对随机森林在小样本数据降维过程中出现的难点, 提出了一种基于小样本数据特征提取算法OTE-GWRFFS。基于生成对抗网络GAN进行样本扩充, 避免传统随机森林在小样本分类过程中的过拟合现象; 在数据扩充的基础上采用基于权重的最优树集合算法, 减小生成数据分布误差对特征提取精度的影响, 提升决策树集合的整体稳定性; 采用单棵决策树的权重与特征重要性度量值加权平均得到特征重要性排序, 从而解决了小样本数据特征选择过程中精度低稳定性差的问题。通过UCI数据集将所提算法与传统随机森林以及基于权重的随机森林算法进行实验对比, OTE-GWRFFS算法在处理高维小样本数据时具有更高的稳定性和精度。
Key words: high dimensional small sample data / ensemble of optimal trees / random forest / feature extraction / data expansion
关键字 : 高维小样本数据 / 最优树集合 / 随机森林 / 特征提取 / 数据扩充
© 2022 Journal of Northwestern Polytechnical University. All rights reserved.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.