Taiwan
The objective of this study is to apply natural language processing techniques to create a model for identifying malicious software. Two datasets, PE (Portable Executable) and ELF (Executable and Linkable Format), were used, each comprising both benign and malicious files, with a diverse range of malicious program families collected during the data collection process. The datasets were disassembled and preprocessed. The study used assembly language files as text data to train a model to distinguish between benign and malicious programs. The research compared the performance of various models, including bag-of-words, sequence models, BERT, and different n-gram models.
The research findings indicate that the bag-of-words model performs best when using multi-hot encoding, achieving an F1-score of 96.87% on the PE dataset. In the case of sequence models, the transformer encoder with positional encoding yields the optimal results. When comparing different n-grams, the multi-hot bag-of-words model and the TF-IDF bag-of-words model present the highest F1-scores in 2-gram and 5-gram scenarios, respectively.