How to compute the clusterization of a very large dataset of malware with Open Source tools for Fun & Profit?

Malware are now developed at an industrial scale and human analysts need automatic
tools to help them.

We propose here to present the results of our experiments on this difficult problem: how to cluster a very large set of malware (with only static information) to be able to classify some new malware.
To cluster a set of (numerical) objects is to group into meaningful categories these
objects. We want objects in the same group to be closer (or more similar) to each other
than to those in other groups. Such groups of similar objects are called clusters. When
data are labeled, this problem is called supervised clustering. It is a difficult problem but
easier that the {\it unsupervised clustering} problem we have when data are not labeled.
All our experiments have been done with code written in Python and we have mainly used
scikit-learn so you will probably be able to do the work again with your own feature
vectors (well we hope for you!).

We will present some results on our dataset of two million malware. We will give some example of the results we have found and we will talk about future works
that could be interesting to do (well: problems still to be solved).

Print Friendly, PDF & Email
Sébastien Larinier

Sébastien Larinier

Security researcher and freelance at Freelance
Sébastien Larinier

@Sebdraven

OSINT, Python,Malware Analysis, Botnet Tracker, SIEM and IPS/IDS and Threats Expert / co-organizer #BotConf / co-creator of #FastIR
RT @Maijin212: I am at @hack_lu #hacklu if anybody wants to talk about @r2gui and @radareorg feel free to ping me! https://t.co/N4DBSYblmD - 10 hours ago
Sébastien Larinier