How to compute the clusterization of a very large dataset of malware with Open Source tools for Fun & Profit?

Malware are now developed at an industrial scale and human analysts need automatic
tools to help them.

We propose here to present the results of our experiments on this difficult problem: how to cluster a very large set of malware (with only static information) to be able to classify some new malware.
To cluster a set of (numerical) objects is to group into meaningful categories these
objects. We want objects in the same group to be closer (or more similar) to each other
than to those in other groups. Such groups of similar objects are called clusters. When
data are labeled, this problem is called supervised clustering. It is a difficult problem but
easier that the {\it unsupervised clustering} problem we have when data are not labeled.
All our experiments have been done with code written in Python and we have mainly used
scikit-learn so you will probably be able to do the work again with your own feature
vectors (well we hope for you!).

We will present some results on our dataset of two million malware. We will give some example of the results we have found and we will talk about future works
that could be interesting to do (well: problems still to be solved).

Print Friendly, PDF & Email