Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic Dataset

Authors

DOI:

https://doi.org/10.54327/set2025/v5.i1.167

Keywords:

malware detection, accuracy, random forest, flow-based model, balanced dataset, synthetic dataset

Abstract

The accuracy of malware detection is closely related to the available datasets, which are often small and imbalanced. To overcome these challenges, this study proposed a new method that creates synthetic malware data and increases the size and balance by generating several data sets with a flow-based model. Subsequently, a random forest classifier is fitted on this augmented dataset. This study aimed to analyze the generation of synthetic data based on flow-based models and the impact of synthetic data generation on the performance of a random forest for malware detection. A flow-based model was used to generate a balanced synthetic dataset based on the CICMalDroid2020 dataset. The generated data was used for feature selection and engineering to optimize the Random Forest model. The experimental results demonstrate the effectiveness of the proposed approach. The flow-based model generated an additional 13,402 samples, massively increasing the dataset size, even though the original dataset had only 11,598 data entries. After training on the synthetic augmented dataset, the Random Forest model achieved better performance compared to the original dataset evaluation with metrics precision (93%), recall (100%), balanced precision (96%), and the F1 score (91%). The results show that flow-based model-generated synthetic data can significantly enhance malware detection capabilities.

Downloads

Download data is not yet available.

Downloads

Published

28.03.2025

Data Availability Statement

The dataset is available at the following link: https://www.unb.ca/cic/datasets/maldroid-2020.html

Issue

Section

Research Article

Categories

How to Cite

[1]
N. O. Matsobane and S. Mokwena, “Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic Dataset”, Sci. Eng. Technol., vol. 5, no. 1, pp. 203–212, Mar. 2025, doi: 10.54327/set2025/v5.i1.167.

Similar Articles

1-10 of 60

You may also start an advanced similarity search for this article.