MalwarETL is an ETL pipeline for malware. In other words, it’s trying to be an out-of-the-box malware collection and analysis lab. It’s intended to be a way to collect lots of live malware, for use in statistical analysis of malware. You could also use this malware for reverse engineering practice, etc, but I’m aiming the system more towards large-scale statistics of collected malware, i.e. machine learning.


I did this because I wanted to do machine learning and statistical analysis on live malware. The EMBER and SOREL datasets are fantastic, and I don’t want to sound like I’m dissing them at all, but those datasets are deliberately defanged - meaning the executables in those datasets are modified so that they won’t run. If you’re just looking at data inside the dataset, that’s fine, but if you want to compare the malware in those datasets against any executables outside the datasets, since the defanging process will be an enormous bias (and easily-detected signal). That makes me a bit sad. So, while I think the EMBER and SOREL datasets are excellent tools, and I’m happy that Endgame security and Sophos released them, those datasets don’t really cover what I want to do with malware in the long run. So…need to get my own stuff.

What do I need to use it?

You’ll need a couple things in order to run MalwarETL:


The instructions under the malwarETL-k8s repository should give you a full checklist of what you will need to do to install a full malwarETL setup. The only thing left out of that is the hardware setup (installing Ubuntu on the VM servers, setting up the file shares), and the Prefect setup. The instructions for installing prefect and getting it working are here.

Anything else I should know?

Yes. If you can, please donate to one/many of the data sources that MalwarETL uses, such as VXUG. These are usually personal projects that run on a shoe-string budget, so if you’re getting value from their project, it only seems fair to give back to them.