Search

Know Your YARA Rules Series: #6 We Present GenRex – A Generator of Regular Expressions 

Following our guide about regular expressions, we present a new unique tool that can help you with a creation of such expressions, mainly for those used in the YARA Cuckoo module.   

To fully understand the benefits of our new open-source project, we first expand our knowledge about regular expressions in the Cuckoo module, share resources that can come in handy, and explain how to use GenRex for the best results.   

Cuckoo module 

Dynamic analysis is an essential part of our toolkit for malware detection. For that reason, we use the Cuckoo module often and contribute to it, as well as tools supporting it for practical usage. If you have never used the Cuckoo module, it can be confusing at first because it works a little bit differently than the basic YARA rules, but it is an efficient module, and it is worth learning about.  

Let us look at a simple example of YARA rule using the Cuckoo module: 

import "cuckoo" 

rule evil_doer 
{ 
    strings: 
        $some_string = { 01 02 03 04 05 06 } 
    condition: 
        $some_string and 
        cuckoo.network.http_request(/http:\/\/someone\.doingevil\.com/) 
} Code language: JavaScript (javascript)

To use the following rule, we need the samples we want to scan, but also reports from dynamic analysis from these samples represented as JSON files (an example of such report available here): 

./yara -x cuckoo=behavior_report_file rules_file sample_file 

YARA does not run the samples; it uses the collected data about the sample’s behavior. By utilizing both cuckoo functions, strings, and other modules, we can combine multiple characteristics of malware to capture it more precisely.  

You can find more information about other functions in this module in the official documentation. We are also planning a separate blog post about the modules to provide even more tips on using them effectively. Stay tuned for more information. 

Dataset 

We recommend checking out our report’s dataset containing CAPEv2 reports from labeled malware and cleanware for experiments. For malware families, we have examples of trojans, worms, spyware, and bots. The latest version can be found on the GitHub page.  

We also provide the list of clean named objects that can be used for filtering and an extension of the YARA Cuckoo module to detect API calls, atoms, resolved APIs, and semaphores. On top of that, you can also compare how many matches were found in the report, such as the following example: 

import "cuckoo" 

rule test_cuckoo_extended 
{ 
    condition: 
        cuckoo.genrex.api_call(/WNetGetProviderNameW/) >= 3 or 
        cuckoo.genrex.atom(/rOBDoI/) >= 3 or 
        cuckoo.filesystem.file_access(/(^|\\)C:\\Users\\[^\\]+/) >= 12 or 
        cuckoo.registry.key_access(/(^|\\)Software\\Downloader/) or 
        cuckoo.sync.mutex(/kzyyjqyi/) >= 1 or 
        cuckoo.genrex.resolved_api(/iertutil.dll!#16/) >= 3 or 
        cuckoo.genrex.semaphore(/LJpExtC8rffiNYPa94/) >= 2 
} Code language: CSS (css)

GenRex is a Python library with a Poetry package manager. To run GenRex, you need Python 3.10 or newer.   

You can either use the GenRex directly from the code available on the GitHub page or install it with the pip tool: 

pip install genrex-py 

On the main page of our repository, there are two elementary examples of how to use GenRex. You can use GenRex as part of your pipeline, but the most straightforward way to get results is the following: 

import genrex  

results = genrex.generate(source={“sample1”: [strings,...], “sample2”: [...]}) 
print("Results:") 

for result in results: 
    print(result) Code language: PHP (php)

For each regular expression created from the input (accessible as a result[“regex”]), several additional characteristics are also generated, such as statistical information about how prevalent the strings creating the regular expression were in the samples. Based on that information, you can write rules that specify that the family makes a set of 3 atoms covered by specific regular expression. 

import genrex 

results = genrex.generate( 
    input_type=genrex.InputType.MUTEX, 
        source={ 
            "source1": [ 
                "aabcmalware7992", 
                "adeemalware3022", 
                "aefdmalware1896"], 
            "source2": [ 
                "bfbcmalware5996", 
                "bbcamalware4508"], 
  }) 

print("Results:") 
    for result in results: 
        print(result) 
 

Results: 
Regex: (^|\\)[0-9a-f]{4}malware[0-9]{4}$ 
Ngram: malware 
Unique: 5 
Min: 2 
Max: 3 
Average: 2.5 
Resources: ['aabcmalware7992', 'adeemalware3022', 'aefdmalware1896',
'bbcamalware4508', 'bfbcmalware5996']
Originals: [] 
Named object type: mutex 
Hashes: ['source1', 'source2'] Code language: PHP (php)

Even though GenRex is a project still in development, it was successfully integrated into our systems and produced good results. Thanks to this experience, we have a few tips that can help you achieve the best results possible with GenRex: 

  • Use input type when possible. GenRex has several heuristics used for specific input types, such as mutexes, file paths, etc. If you provide this information, GenRex can generate better results, recognizing the pattern in the specific named object category.  
  • The empty result list is also a valid result. GenRex should not create regular expressions for all costs but rather notice interesting characteristics in the input set. If GenRex does not provide sufficient results, try to inspect the input – it is possible that it is too random, and common aspects are to be detected. Filtering the input (for example, with the set of the clean named objects) can help.   
  • You can join all input strings into one list; however, you usually get better results when providing GenRex with information about the source of these strings. You also get better statistical information that can be used for YARA rules. 
  • The goal of the GenRex is to generate a good quality regular expression. However, we still recommend checking them against clean samples to test if they do not cause false positives.   

Conclusion 

We are excited to open-source and share this new project with the YARA community. We hope you find GenRex interesting and useful. We look forward to your feedback, and if you have any ideas or encounter any issues, don’t hesitate to contact us. Also, feel free to open issues or pull requests — we appreciate it.  

If you are interested in GenRex and want to know more about it and its research, the paper GenRex: Leveraging Regular Expressions for Dynamic Malware Detection will be published in the following months as a part of IEEE TrustCom 2023 proceedings. You can also request preprinting through ResearchGate or contact Dominika Regéciová on Twitter.   

As a very fresh update while writing this post, we are thrilled to announce that our proposal for the Botconf conference, titled GenRex Demonstration: Level Up Your Regex Game, has been accepted. This exciting opportunity allows us to showcase practical examples and real-world applications of our GenRex tool. Join us from the 24th to the 26th of April in Nice, France at the conference, where we’ll delve into the intricacies of GenRex and demonstrate how it can significantly enhance your approach to regular expressions. We look forward to sharing insights, exchanging ideas, and engaging with the vibrant community at Botconf.  

To conclude this blog post, the main takeaways are:    

  • The Cuckoo module is a convenient tool for behavioral analysis, and it implements additional features for regular expression matching.    
  • GenRex is an open-source project that can help you generate regular expressions for the Cuckoo module and other use cases.  
  • The source of the project can be found on the GitHub page: https://github.com/avast/genrex  

And that is all for today! We wish you happy YARA rules writing!