top | item 15845952

(no title)

lqdc13 | 8 years ago

Having worked on a machine learning-based AV for several years, I'd like to point out that the dataset choice here is extremely important and they seem to have a pretty small one considering the number of possible variations and the choice of model.

What happens in the wild is that one malware author releases a lot of very similar polymorphic or differently-compiled malware so it ends up being trivial to identify it. For example, they could have picked up a small icon that is common to half of the malware or some internal library that is used in a large portion of them. Then a week later the nature of the malware changes and you would identify a lot less.

Another thing to consider is that in many cases, a tiny modification to a known good program can make it malicious. This includes such things as changing the update URI. I don't see how they could catch such malware using this method so the 98% detection seems like a very unrealistic number.

Just to present an example:

One can train a simple logistic regression on some metadata features where the malware comes from one source and easily identify almost all of them correctly, while failing to identify malware from most other sources.

Having said that, it's a pretty cool novel approach and I'd love to try it.

discuss

order

EdwardRaff|8 years ago

Hi, paper author here!

The dataset is small by AV standards, but we aren't an AV company. We can only use as much as real AV companies are willing to share with us. If you'd like to share more, we would be happy to take it :)

The model is fairly robust to new data, and we tested it with malware from a completely separate source than our training data - so there shouldn't be any share items like icons between the training set and the 2nd testing set. However, we aren't arguing that is of an AV quality today. The main purpose of this research was to get a neural network to train on this kind of data at all, as it is non trivial and common tools (like batch-norm) didn't translate to this problem space.

We are looking at the modification issue! I can't share any results yet since we have to go through pre-publication review, but the issue isn't unknown to us!

tedivm|8 years ago

Did you talk with VirusTotal? That's probably the largest dataset out there that isn't controlled by an AV company.

rbanffy|8 years ago

What is the size of the Clamav dataset?

lqdc13|8 years ago

Awesome. I hope I'm wrong and I'm looking forward to trying out your approach!