How to Compute the Clusterization of a Very Large Dataset of Malware with Open Source Tools for Fun & Profit?

Botconf 2017
Wednesday
2023-04-27 | 10:30 – 11:10

Robert Erra 🗣 | Sébastien Larinier 🗣 | Alexandre Letois | Marwan Burelle

Malware are now developed at an industrial scale and human analysts need automatic tools to help them.
We propose here to present the results of our experiments on this difficult problem: how to cluster a very large set of malware (with only static information) to be able to classify some new malware. To cluster a set of (numerical) objects is to group into meaningful categories these objects. We want objects in the same group to be closer (or more similar) to each other than to those in other groups. Such groups of similar objects are called clusters. When data are labeled, this problem is called supervised clustering. It is a difficult problem but easier that the {it unsupervised clustering} problem we have when data are not labeled.
All our experiments have been done with code written in Python and we have mainly used scikit-learn so you will probably be able to do the work again with your own feature vectors (well we hope for you!).

We will present some results on our dataset of two million malware. We will give some example of the results we have found and we will talk about future works that could be interesting to do (well: problems still to be solved).