(Submitted on 1 Aug 2019)

Abstract: N-grams have been a common tool for information retrieval and machine
learning applications for decades. In nearly all previous works, only a few
values of $n$ are tested, with $n>6$ being exceedingly rare. Larger values of
$n$ are not tested due to computational burden or the fear of overfitting. In
this work, we present a method to find the top-$k$ most frequent $n$-grams that
is 60$times$ faster for small $n$, and can tackle large $ngeq1024$. Despite
the unprecedented size of $n$ considered, we show how these features still have
predictive ability for malware classification tasks. More important, large
$n$-grams provide benefits in producing features that are interpretable by
malware analysis, and can be used to create general purpose signatures
compatible with industry standard tools like Yara. Furthermore, the counts of
common $n$-grams in a file may be added as features to publicly available
human-engineered features that rival efficacy of professionally-developed
features when used to train gradient-boosted decision tree models on the EMBER
dataset.

Submission history

From: Edward Raff [view email]

[v1]
Thu, 1 Aug 2019 03:58:11 UTC (592 KB)

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here