With this article, we are starting a new series on the Engineering blog – Know Your YARA Rules. In this series, we would like to share tips and tricks we learned from using YARA daily. We aim to pick fewer known facts about YARA and how it works, so you can write even better YARA rules that are fast and precise in detection. Because no one likes slow rules with false positives. This first post will investigate how YARA works internally and how it leads to common misconceptions, slowing down rules and problems with too many matches.
Motivation
Creating a good rule that correctly detects a described family and does not cause false positives requires skill and experience. Even after all the necessary static and behavioral information is extracted and described in the rule, another problem can arise – the scanning performance. YARA has several mechanisms to detect potential slow scanning and generates warnings, such as:
# Running YARA with a rule over the input directory (recursively)
$ ./YARA rule.yar -r input_directory
warning: rule "family_XYZ" in rules.yar(x): string "$re" may slow down scanning
Code language: PHP (php)
These warnings are based on heuristics that evaluate the quality of the rules. They inform us that the rule could be written more effectively. They often lack information on what change is needed to improve performance. What makes the issue worse, the rules with warnings cannot be used in some systems like VirusTotal Hunting (https://www.virustotal.com/gui/hunting-overview).
For these reasons, it is a good idea to understand what the warnings and errors are trying to tell us. However, it is not always so simple. Sometimes, even experienced YARA rules writers wonder what is wrong with their rules and how they can fix them.
In this series, we will go through some problematic cases that we have encountered at Gen, and we will provide the solutions we came up with.
YARA Internals
To understand more about our tips and tricks from this and the following posts, we need to look up YARA under the hood and go through the process of the rule’s evaluation. For this purpose, let us consider a simple YARA rule:
rule example_rule
{
meta:
author = "Dominika Regeciova"
strings:
$str = "Hello World!" fullword nocase
$re = /abcd[x-z]/
$hex = { 63 62 61}
condition:
$hex at 0 or
$re or
$str
}
Code language: PHP (php)
In this rule, we have three expressions where at least one must be true to match a sample. Either a text string delimited by non-alphanumeric characters is found in the sample (while ignoring the case of the characters), or regular expressions are found or a hexadecimal string in starting position 0 (at the beginning of the file). If a sample meets the condition, Yara will match it and report it as a result.
The entire process of evaluation of this rule can be split into four steps: atoms selection from strings, creation of the Aho-Corasick automaton, bytecode engine run, and evaluation of conditions.
Atoms Selection from Strings
From all the strings from all rules, so-called atoms are selected. The atoms are substrings with lengths from zero to four bytes. Yara has several heuristics for choosing the most unique and, thus, most effective atoms. In our example, from a regular expression /abcd[x-z]/, Yara will select the part abcd.
The heuristics, however, have their limits, and for strings that are too general, such as /w*/, even zero-length atoms can be chosen. This is problematic because a later, much slower state does all the matching, and the input is searched byte by byte. This will lead to a warning about slowed scanning and limiting the rule’s usability in systems like VirusTotal Hunting.
Aho-Corasick Automaton
All selected atoms are used to build a prefix tree called the Aho-Corasick automaton. This automaton works as a sifter – it quickly scans through every input and finds all potential matches for the atoms. This is a crucial step that influences how fast or how slow the scanning is. If we have too many general atoms, many — even each very byte, is selected as a potential match and must be inspected further. We want to limit the set of potential matches, so the rest of the evaluation is much faster. Below is an example of the Aho-Corasick automaton for atom abcd.
The basic idea behind using the Aho-Corasick automaton is that it effectively matches every occurrence of multiple strings simultaneously, even when the matches are interleaving. Starting in the root state, the automaton reads one symbol from the input at a time and changes states accordingly. The atom is found in the input if the final state is visited.
We can find multiple matches in the same position because we have so-called failure functions. The failure function is used when there is no transition based on a combination of the current state and input symbol. The default action is to go to the root state and try again. However, if the prefix of the other string was partially matched, we can go there instead. All failure functions point to the root in our example because there are different characters, but we can choose another group of strings to demonstrate this situation.
Let us say we are looking for the words she and his in the text shisa (Chinese guardian lions – https://en.wikipedia.org/wiki/Shisa). We started matching sh in the keyword she, but now, we need e, but we have i instead. We could go to the root again or try to match his instead because we know we already read h. And yes, we can find his in the rest of the text.
Bytecode engine
The bytecode engine gets the list of potential matches and verifies which are true positives based on the full definition of strings (including modifiers like wide, nocase, and others). This process takes time, so as we discussed before, making a list as accurate as possible is a clever idea. Yara also reports the position in the input if a match is found. In our example, there is a match in position 0x0.
0x0: abcdx => match
0x7: abcdf => not a match
Code language: JavaScript (javascript)
Evaluating the condition for the whole rule is the last step. It is important to remember that conditions are evaluated after matching the strings. If you want to limit the file size that you want to match with the filesize function, it will not prevent the previous steps from being executed when your rules match static strings. The good news is there is short-circuit evaluation, so changing the order of conditions can positively impact performance.
In our example, we found the match in position 0x0, and the condition is evaluated as True.
Rulesets
Very often, we work not only with a single rule but several rules simultaneously. YARA can operate with them effectively and use them for scanning at once rather than one rule at a time. However, this approach has its disadvantages. Scanning with rulesets is generally the same as with one rule. This means that if you have a set of fast rules and add one slow string to one rule, the whole ruleset will be slowed down.
The filesize doesn’t rule them all
Now, let us use the knowledge of YARA evaluation in the practical example. This is a widespread mistake that can make your rules really slow. Let’s say you want to use the following rule:
rule slow_rule
{
strings:
$re = /very slow RE/ // RE like w*, ...
$str = “some basic string”
condition:
filesize < 512 and ($str or $re)
}
Code language: PHP (php)
As we explained in the previous section, the condition part is evaluated AFTER the strings.
It would be nice if YARA evaluated the filesize first, and then it would search the text string and regular expression afterward. However, it is not the case. There is an ongoing discussion with the main author of YARA (https://github.com/VirusTotal/yara/discussions/1781#discussioncomment-3522238) about a new approach to this problem, so there is hope for the future version of YARA, but only time will show.
Currently and in this rule, both regular expression and text string are evaluated first. Atoms from them are selected, Aho-Corasick automaton is created and run, returning all possible matches. That means the slow regular expression will be searched in every file even if the filesize part of the condition is False. One slow string can slow down rules or even ruleset, so be careful what is matching. In the next post, we will bring more recommendations on how to optimize the matching of specific strings.
This does not mean the filesize cannot improve your rules. If you have more complex conditions that work with slower modules (cuckoo, magic, …), the limitation of the filesize as the first part of the condition is a good idea. If the filesize part of the condition is not True, the short-circuit evaluation occurs, and the rest of the evaluation is stopped.
Conclusion
In this post, we explained how YARA works internally and one of the most common misunderstandings of rules evaluation.
The main takeaways are:
- The condition part of the rules is evaluated after the strings section.
- One slow string can slow down a whole rule or even a ruleset.
- Filtering input data using filesize will only work if you want to skip evaluating the rest of the condition. It will not skip the strings matching part.
And that is all for today! We wish you happy YARA rules writing!