LABScon Replay | Star-Gazing: Using a Full Galaxy of YARA Methods to Pursue an Apex Actor

This must-see talk discusses a highly-regarded but rarely publicly investigated threat actor, malware similarity, and YARA. Publicly available data yields just a generic AV signature with the actor’s name, leaving a void for malware analysts looking to understand the overlaps between different malware families attributed to the same actor.

Greg Lesnewich explores how analysts can use YARA as an analyzer with the console output, leveraging some simple Python scripting, to develop a malware similarity methodology. With a little – but not too much! – effort, analysts can easily build their own custom malware analysis toolkits using nothing other than freely available open source projects.

Greg’s presentation highlights just how well YARA can be used to pursue an apex predator and contains plenty of examples and links to all the tools used in the talk. Greg also shares the custom tooling he built as he analyzed a notorious threat actor, which can easily be adopted or adapted by other analysts to suit their own purposes.

Star-Gazing | Using a Full Galaxy of YARA Methods to Pursue an Apex Actor | By Greg Lesnewich (Proofpoint): Audio automatically transcribed by Sonix

Star-Gazing | Using a Full Galaxy of YARA Methods to Pursue an Apex Actor | By Greg Lesnewich (Proofpoint): this mp4 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.

Greg Lesnewich:
Hello, everyone. Thank you. To the lab’s organizers, to Ryan, to JAGS, everyone at S1, all the event staff for this amazing event. I think I’m not the only one that’s been enjoying a week here so far.

Greg Lesnewich:
My name is Greg. I work at an email company called Proofpoint. That is, my job is primarily doing what Victor does following me, chasing the L word out of our email data. And today’s talk is nothing about that. So before we start, this talk does discuss a bit of a taboo actor, which I track as Bright Constellation. But there are a litany of disclaimers that my wife and our company mandated that I say I do not discuss the incident responses. We do not actively pursue this actor. There are no leaked documents herein; this is personal research and although the actor is something that was a little bit attention grabbing previously, it was mostly sort of a interesting piece of data to explore developing a malware similarity via YARA.

Greg Lesnewich:
So there are going to be some musical references scattered throughout here that link to the naming of the malware families themselves. If you can figure them out, you can take a shot after the talk with me. So first, I think that YARA and a lot of parts of this conference only happen from learning through one another and being people being open and willing to share and teaching others. And so the list of humans and robots far exceeds this slide that have helped me to really learn and understand and develop some better ideas for detection ideas.

Greg Lesnewich:
A few that I want to call out today are Connor McLaughlin, Arielle and Costin from Kaspersky Xorex and of course, our pal Steve Miller. And so getting to the elephant in the room, our subject today is the Lamberts. Those I think everybody here probably knows who they are because Juan knows who they are. And at least in my time in the Threat Intel space, they have been maybe the highest regarded actor that I’m aware of. Juan has talked on and on and on about their amazing multi framework toolkit and their incredible operational security and their awesome tradecraft. And so I’m building on a lot of the work that Symantec and Kaspersky and previously FireEye had published about. But my interest in them is basically only because I knew that if I submitted about them, Juan was likely to accept my talk.

Greg Lesnewich:
So the Lamberts present a little bit of an interesting problem for us as an industry. Kaspersky had this amazing Kaspersky and Symantec really had this amazing series of very interesting actors with white papers getting published about them, like Equation like Project Sauron, like Stuxnet, Dooku, Name one. And they had all these really rich papers discussing the malware and doing all these sorts of deep technical analysis that you could walk away with an understanding of what was happening.

Greg Lesnewich:
And Kaspersky has no reason to like, I’m not putting throwing shade on them, but their paper about the Lamberts was noticeably shorter. There weren’t a lot of hashes published with it, but they did have this cool chart showing the constellation of the Lamberts toolkit that, you know, there wasn’t a white paper to sort of support the linkages or highlight what was going on there, which to me presented a pretty interesting opportunity because if you go on VirusTotal, there is a detection across ESET and Kaspersky that just says Lamberts, but it unfortunately is not linked to any of the colors listed there. So it presented kind of a fun black box for us to play with.

Greg Lesnewich:
And so I think like most other threat intel analysts, this is a familiar sight. After another vendor publishes a report, you have a list of files that if they didn’t publish a YARA rule or some other form of detection, you just sort of have to figure out detection in your own environment. And so, yeah, this is our starting point, I think like a lot of other investigations.

Greg Lesnewich:
And so the initial methodology and what we’re going to walk through a few different steps that I took that I thought was decently valuable. I’m going to take a macro view of all of the 50 samples that were available on VirusTotal at the time that I started this.

Greg Lesnewich:
And what we’re going to do is we’re going to rely really heavily on a couple of tools like Yara, particularly its console module. For those of you that aren’t familiar with it, it’s like a console, like anything else, like Python, whatever else. A script that Steve Miller built to sort of wrap the console module called Ronnie, which is a Ronnie Coleman reference that I think one person in this room gets. And then we’re going to use another tool called Binary Refinery to sort of show the evidence of some of the data that we’re working with. And given knowing the crowd here at Labs Con, I’m going to use that as an excuse to really roll really quickly through the first section of the content.

Greg Lesnewich:
So initially. Like most other analysts, you’re looking at samples in bulk. We’re going to look for overlaps across the import hash hashes of the sections, the resources, and then more like developer fingerprints. The PDB path, the DLL name, and then sort of looking at the general geometry of all these files. And so if you take this initial surface area, even for this elite, highly apex actor, we can already start to see some overlaps with these DLL names up here at the top and then some import hashes mixed with DLL .dll there at the bottom.

Greg Lesnewich:
And so one of the things that I want to really highlight in this talk is the codification of what you can do with a local YARA instance, like on an analyst machine and just plug your ideas into console output rules.

Greg Lesnewich:
And so you can have it burp out things like, say, the rich header hash and then use sort and unique to burp out overlaps. And as you work through this and look at at least in this actor in particular, and I think this applies to a lot of them, you can start to start. You can start to see a number of weird overlaps like these DLLs mixed with the A PDB paths. And if you iterate and iterate and iterate and you look at things like the resource and section hashes, ignoring that very obvious empty hash there at the top, you do eventually get to start clustering some of the families, notably the PDB path, the export names and. More like general hashing was really good for us to start to cluster some of these families together. And we actually have our first linkage across the malware families to each other with rationalist and cutting ties, sharing this weird smartcard helper string resource. Still don’t know what it means, but it’s sort of a weak link to point these families together. So after this first round, I think of methods and techniques that we’re all familiar with. We’ve had we have 13 families clustered, but we still had 30 to 40 files outside of those folders. So we still had more to do.

Greg Lesnewich:
And it sort of becomes immediately obvious that as you’re doing these things in bulk, using just features doesn’t like really highlight how the samples are related to each other and it’s pretty brittle. So an import hash can change. They can decide to change the name of an export. And so we want to do something a little bit more resilient. And so one of the themes of this talk is going to be, can we do more? Can we do better? And so let’s keep digging in and try and answer that.

Greg Lesnewich:
And I think the golden goose of all of YARA stuff is finding shared code, not from shared features. And the benefit that we have is that we can use YARA’s console output without necessarily needing to use something like a disassembler or a hex editor for every single file. Especially as for more traditional threat intel analysts, you’ll get the files in bulk, not one by one by one. And so if we want to sort of hone in on at least where code is, the PE file format dictates where it is, so we can look for sections just as a first example that are marked as containing code or as memory executable by the file format itself. And then instead of hashing the full thing where there might be padding, there might be differences in data at the end of it.

Greg Lesnewich:
What if we just hash the first 100 hex bytes and call that a sector hash and throwing this at the wall? There was already an easy win there with the eight sector hashes marked at the top compared to the rest of the seven numbered section hashes. So immediately we had something stick. And surprisingly, you know, we can see the data that gets hashed here. There are a lot of these three instructions that might not be the most interesting or unique code, but its position and its positioning, clustering together ends up being unique to a family that we track as rationalist.

Greg Lesnewich:
But can we do better than just blindly hashing data at the start of a section? I hope that the answer is yes. And so there are a couple of other places that the P tells us. There is code mostly at the entry point and the export functions. And so what if we did something silly like using a console rule to hash the first 20 bytes from the entry point on forward? And the other thing that you can end up doing with the console, instead of just putting in like a string, you can put almost the entirety of a YARA rule whenever you’re having sort of these. Maybe your AC failed over the summer and you’re having some weird ideas about how to find malware similarity. You can codify that sort of in the moment that you’re thinking about what you want to do and have it live on on your analyst machine forever and sort of codify that.

Greg Lesnewich:
And so it can sort of be YARA automation, maybe not perfectly, but in a way that you control. And so this, this ends up actually working and allowed us to cluster a few new families. In this instance, a family we tracked as Marianas Trench. You can see that these hashing that first 20 bytes got catches, a lot of conditional jumps and decrement instructions, which it turns out was really useful because the export name changed over a bunch of the samples. But hashing those first 20 bytes with those particular instructions was unique across not only that sample among the other Lamberts and Bright Constellation samples, but across all of VT and my own very small malware repository. So we had some wins from that and we were able to cluster some additional families using some of these sector hasher sector hashing entry point and export hashing methods, namely invisible enemy bloodletter and existence. But there are a lot more functions inside of these PE files, as many of you know, and so coming back to the question, can we do better? Most of those exports and entries entry point functions do call other functions. So how do we get to those?

Greg Lesnewich:
This actually ends up becoming a little bit of a math problem, which took me an embarrassingly long time to sort of figure out. But YARA can loop over a certain set of bytes inside of a file. And so if you pass it, something to look for the entry point and the first 25 bytes after it and look for any relative call instruction, you can modify the bytes that come after that and sort of follow that into the next function and then hash that. So you get a little bit of this idea of provenance, of something getting called from an export or an entry point and then the code that is inside of it.

And so in this in this example, this allows this allowed us to cluster a family that we tracked as escape artist, where YARA iterates over the first 25 bytes of this export and follows both of those functions and hashes them to see if they match that hash. And the second one they do. And what that data ends up being is the first 14 bytes down to that push, 200 instruction. Again, maybe a little surprisingly, this was a completely unique feature to just escape artist. It is code. It might not be like perfect code overlap, but it’s only clustered among these three samples of escape artists and nothing else out there on or in my malware collection. So once it becomes a math problem, you can sort of get into this idea of like tertiary or whatever comes after tertiary function hashing, which you know.

Greg Lesnewich:
If Yara is cooking. Some people like VT have a vested interest to keep their restaurant running smoothly. Doing stuff like this is like brewing beer in your basement as like a personal experiment. So don’t write rules like this and put them on on VT or your own internal tooling because I think that it can be useful for really exploring your own knowledge of where things are coming from and where to find overlaps across a very small set of files. But it may be more of a last resort thing. And trying other things like conditional jumps or absolute calls were pretty useless. But it did get us another family to cluster in.

Greg Lesnewich:
And by this point we have all these families clustered together. There are two that stand out here level or impairment that only have one file in them each, but from that they didn’t fit into any other buckets, so they sort of got deemed to be their own family. But we don’t really have any idea how they relate to each other. And so we’ve sort of reached the limit, in my opinion, of what we can do with just the console. And so we have to sort of expand our tooling and go look in a different direction. And like Philippe, I have us staring at the abyss titled Slide. And so it comes down to the fact that we have to disassemble.

Greg Lesnewich:
And in that disassembly, we also have to and, you know, sort of disassembling meaning going down to the function level of the P. And then we also have to account for changes in the file like to addresses that get called so that way you can wildcard them out and avoid them. But using those functions is kind of a pain in the ass. So in the previous escape artist example, there are 678 functions inside of it. And how do you pick among those which to focus on? Do you take those that have a high cyclomatic complexity? Do you pick those that have a ton of cross references? Making thresholds for those is really difficult because you don’t necessarily have the best idea across all the files of what a large number of cross references is. And so how do you pick which functions to hone in on and sort of thinking about this over the summer, the answer was gifted to me in the form of a guy called Willy Mellenthin and a tool called Floss that I think some of you would be familiar with. There’s a Mandiant tool, so thank you to William Moritz for building it. That does a lot of cool stuff that doesn’t really get talked about. It uses this engine that Visy built called Vivisick to emulate, which is how it follows functions like these ones that’s shown here in the screenshot and then burps out the decoded strings.

Greg Lesnewich:
It turns out that if you use the X flag in the previous version of Floss or the V flag in the current one, you can get not only the offsets of those strings to write a rule on, but you can also then get the likely decoding function. So those end up being, at least in my opinion, decently high fidelity. And so over the summer where Willy Valentine stepped in was that there was they upgraded floss to version 2.0, which exposed it as a Python library. And you can write a function here like Willy kindly did for me when I asked him a question. And his solution was build a tool for me instead of just answering the question. And we can use those as a feeder for disassembly.

Greg Lesnewich:
And so we can. I don’t have an idle license and Risen was was a really good option for us to sort of walk through and disassemble all the files, particularly because it does this Zignature masking which allows you to basically mask out the address and wild card it instead of just just taking the bytes out of each function individually. And so what you get is the golden goose of a decently interesting code base YARA rule. We’re open sourcing this today. The link is going to be up here in a little bit. We’re calling it Floss2YAR because we are not very creative, but this was sort of my solution to looking at the Lambertz toolkit and figuring out how to link the different disparate families together.

Greg Lesnewich:
Like anything else, it has limitations, but we put it out for free. And so if it sucks, I wrote it. So you get what you pay for. There are a bunch of other tools out here that do this too, but I didn’t have a great understanding of how they were doing things like your yard-signator and Binlex are awesome, but there wasn’t really like a direct answer of like. Okay. You know, this is rare, but what’s it doing if you’re going to go through the time of disassembling something like, you might as well have an idea of what it’s doing. And so the benefit of honing in on likely decoding functions is that you get things like this. This is a slide that I blatantly stole from Costin that linked a bunch of these different Lambert families together. And so if you looking for decoding functions gets you things like this with all of these weird XOR move instructions.

Greg Lesnewich:
And so what happens if you keep writing, running this over batches and batches of these files and very quickly failing and finding things that are sort of just generic windows functions. You can start to link these nodes not only with the sort of idea that you have this understanding that the code is similar, but that the functions are actually shared.

Greg Lesnewich:
And so you can iterate and iterate and iterate. Also occasionally running on export functions and you end up landing on this constellation of Lambert’s tools, which looks a little bit cooler if you color in what I suspect the families actually are. Ariel in the back is going to grade my effort here at the end because Kaspersky knows way more about it than I do. But this was sort of my best guess for what these families were. If some were updated versions and sort of mapping to their color coding.

Greg Lesnewich:
So looking at how we did using this method, we were able to link 14 out of the 21 families. There are six families that we left out to dry. So if you subscribe to the D’s or C’s get degrees, you could call it a win. I do. So I am calling it a win. And you know, looking at all of these files in aggregate, a couple of things do end up standing out like they really like running as Windows Services. There’s a lot of interesting functions that build out Windows services that have string names for that sort of spoof advertising corporations, sort of similar things in their C c2’s And it sort of became apparent as I was exploring them that they’re really keen on hiding from a systems administrator that knows what they’re doing and the Windows operating system, sort of general logging and telemetry.

Greg Lesnewich:
There wasn’t a lot of like user evasion where none of their files had like a PDF icon to entice someone to click, nor was there any sort of like direct AV evasion, at least in my analysis of like I guess, 80 files at the end of the day after retro hunting.

Greg Lesnewich:
There are some shortcomings. With this. I’m probably missing a whole litany of files that Kaspersky has sort of discussed in open source, as well as Juan during. He can’t give a conference talk without mentioning them, but I’m probably missing some things. So there are a lot of gaps to be filled in. I also didn’t really reverse any of these. I’m the hashes and all the rules are going to get shared. So if you want to dive in and contribute to sort of filling in some of the gaps, that would be really cool.

Greg Lesnewich:
And in looking at the sort of assessing like what we did, I think that the type of tooling that is getting found could definitely create a bias in the data set in that if something is running in plain text in memory, that’s much more likely to get clipped and thrown into VirusTotal rather than something that is maybe encrypting big chunks of itself in memory and only decoding them at specific call time. I also could be completely overthinking this and all of those collections and connections that I had in that previous slide, those could all just be like modules of two families and I could be completely overblowing what’s happening and without really doing the the incident response or knowing how the samples interact, we’re at a limit of how we can link them to each other.

Greg Lesnewich:
So I’ll leave this slide up. The tool is here. I think that the link works on GitHub, all of the rules and in comments there are the hashes for the on pastebin, the hashes and the rules, both the sort of wonky ones as well as the code based ones are up there and then the console rule set for like automatically burping out like the import hash or the like tertiary called function hash stuff is all there and just I’m very willing to share the slide deck with people because there are a lot of slides in the appendix about what the families actually look like and some oddities in them. But you have to be a real human and come talk to me to be able to do that. And I’m not just going to tweet it out because I don’t want to get disappeared.

Greg Lesnewich:
And so I think that the main takeaway that I had from doing this research is that YARA is good enough and flexible enough to sort of if it’s good enough to track Bright Constellation or the Lamberts, it’s probably good enough for a lot of the other actors that we’re facing.

Greg Lesnewich:
I will say there is an additional bias in there that these samples were definitely not bloated by certain they’re not written in Delphi, so there isn’t a ton of additional data in there. They’re not stuffing OpenSSL or zlib like full libraries in there either. So that was iterating over them was a little bit of an easier job. But really the thing that I learned most from this doing this research is that if you’re an analyst and you have an idea it is worth your time to learn enough Python or enough Go or whatever language you want to subject yourself to to build it because no one is going to have the same vision that you have and no one is going to know the same outcome that you’re going to want.

Greg Lesnewich:
And so. In that, I don’t know, two and a half years it took me to really feel comfortable writing Python ut sort of enabled this to happen. So if you’re an analyst, the idea is worth it. We’re a better community if you put it out into the world. So yeah, if it doesn’t exist yet, build it. And with that, I’m disappointed that Juan and Kim missed a whole talk about the Lamberts, but I will be taking questions. Thank you. There was no there was no new information. There weren’t any docs. You didn’t miss anything. Ariel. How did I do?

Speaker2:
Awesome. Thank you.

Greg Lesnewich:
Cool. Thank you, everyone. Sabrina Yeah.

Speaker3:
All right.

Speaker2:
Throwing up applause for. For Greg. Awesome.

Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.

Automatically convert your mp4 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.

Sonix has many features that you’d love including advanced search, share transcripts, transcribe multiple languages, collaboration tools, and easily transcribe your Zoom meetings. Try Sonix for free today.

About the Presenter

Greg Lesnewich is senior threat researcher at Proofpoint, working on tracking malicious activity linked to the DPRK (North Korea). Greg has a background in threat intelligence, incident response, and managed detection, and previously built a threat intelligence program for a Fortune 50 financial organization.

About LABScon

This presentation was featured live at LABScon 2022, an immersive 3-day conference bringing together the world’s top cybersecurity minds, hosted by SentinelOne’s research arm, SentinelLabs.

Want to join us for LABScon 2023? The Call for Papers is now open!