LABScon Replay | Quiver – Using Cutting Edge ML to Detect Interesting Command Lines for Hunters

What do GPT3, DALL-E2, and Copilot have in common? By grasping the structure and nature of language, these projects can generate text, images, and code that provide added value to a user.  Now, they even understand command lines!

Quiver – QUick Verifier for Threat HuntER – is an application aimed at understanding command lines and performing tasks like Attribution, Classification, Anomaly Detection, and many others.

DALL-E2 is known to take an input prompt in human language and draw a stunning image with impressive matching results; GPT3 and similar projects can create an infinite amount of text seemingly written by a real person, while Github’s Copilot can generate entire functions from a comment string.

Command lines are a language in themselves and can be taught and learned the same way other languages can. And the application can be as versatile as we want. Imagine giving a command line to an input prompt and getting the probability of it being a reverse shell, by an Iranian actor, or maybe used for cybercrime. A single prompt on its own may not help so much, but with the power of language models algorithms, the threat hunter can have millions of answers in a matter of minutes, shedding a light on the most important or urgent activities within the network.

In this session, Dean and Gal demonstrate how they developed such a model, along with real-world examples of how the model is used in applications like anomaly detection, attribution, and classification.

Quiver – Using Cutting Edge ML to detect interesting command lines for Hunters: Audio automatically transcribed by Sonix

Quiver – Using Cutting Edge ML to detect interesting command lines for Hunters: this mp4 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.

Dean Langsam:
So first of all, I need to say that our code is in Jupyter Notebooks and PyTorch. So if any one of you want to see the code, just use wheels, exploits and we'll be good. Okay, so this is Quiver. I think I did. We did. Gal and I. Let's begin those three logos or logos for three fairly new tools, although they're pretty famous. The first one is Dall-e two. The second one is GPT three and the and the third one is GitHub copilot. And let's start with some examples.

Dean Langsam:
So Dall-e two can create an image from text. In that example, we can see a cybersecurity researcher sitting on a beanbag in front of a pool in the desert in a fancy hotel trying to reverse engineer a nation state malware, working on a presentation in a realistic style. So that's you guys. If you can connect with that one, maybe this is you guys as you can see, it's not very good with text, but you are all cyber security researchers.

Dean Langsam:
GPT three or GPT three is a model that can generate text. It's applications in cybersecurity. Don't really need to read that. What you need to know is that except for the I've written only the gray part and GPT three created the rest.

Dean Langsam:
In the same manner GitHub copilot. I like,this is code that I actually use just some authentication stuff. And when I've written that I just I was just starting to use GitHub copilot and I like only the gray parts or the parts that I've actually typed in and GitHub copilot did the rest for me. You can see that even you have the function that like I made a typo, I called it anonymized password and like it understood that I mean to anonymize the password.

Dean Langsam:
Okay, so what's common to all those models? All those models understand language. They share language. Common language features between users or between applications. And part of the learning process is unsupervised, a term that we'll speak about later. The question is, can we do the same for the language of command lines? And the answer is yes, but well, no. So currently you're thinking like, what am I doing here? I came to a cybersecurity conference and we're here to talk about deep learning. Gal and I are not, firstly, cybersecurity people. We are coming from the field of machine learning and deep learning, and we try to get a free trip to Phoenix. So we managed to.

Dean Langsam:
We're going to talk about the problems we had with command lines before then. What changed that made this one possible. Then about our package Quiver, which as you've seen, the acronym came first. And eventually we'll show the big show of what we've got. This is Gal.

Gal Braun:
So I'm. Gal. Staff data scientist in SentinelOne for the last six years. A father of two. And Breaking Bad is the best show ever.

Dean Langsam:
And we are mostly the same person. I'm Dean. I'm a Staff data scientist in SentinelOne for three years, actually. Gal got me into the company. I'm a father of one, and Breaking Bad is the best show ever. Except maybe The Wire.

Dean Langsam:
So because we're not in a deep learning conference, let's do like a few minute intro to machine learning and deep learning. What you see here are cats and dogs, and those are called samples. We want to create an algorithm that can distinguish between cats and dogs.

Dean Langsam:
One way they try to do this before is like with algorithms that people are trying to generate. Maybe if it has like the ears are, the ears are that way and the tail is that way, maybe it's a cat, maybe it's a dog. And it was a very hard problem. Even a person couldn't tell you like, why the why am I seeing a cat or a dog in this picture? I just like when you know, you know.

Dean Langsam:
So we try to make this in deep learning. We just show the the computer, the algorithm, many examples of cats and dogs. This is called tagging or labeling. And you can go into Google and just type like give me pictures of dogs. Those would be the green ones and then give me pictures of cats. Those will be the red ones. And then you show the algorithm enough samples and it will create an algorithm using what we call training.

Dean Langsam:
Then when you give it a new sample, the gray one, you, you, you don't tell the algorithm which one it is, which one it is, and you put it in the algorithm and the algorithm spits out, well, this is a cat in the same fashion. It says, This is a dog. Now, that was a pretty easy problem because you could search that on Google, like, give me cats, give me dogs. Enough people tagged cats and dogs in the history of time.

Dean Langsam:
Um, but as my friend John Naisbitt, I know he's not actually my friend, but he's a very famous person. He told "We are drowning in information, but we are starved for knowledge". Like all of us have a lot of stuff, like pictures of things, command lines, language, many things. So what we have, we have many command lines in SentinelOne. The thing we don't have is tag data or label data. The people that can actually do tagging for label data like saying is this command line bad or good or bad? The green ones are good. The red ones are bad. Most of the people that can actually label the data for us are in the in this room.

Dean Langsam:
So I could ask you guys, instead of listening to the talk, give me ten minutes of your time and start tagging data for me. But that is very manual process and that would not scale up.

Dean Langsam:
So what changed? Well, in the old time, meet Mimi. Mimi Katz. She's. She's Jewish like us. And she has a task. Separate, like she gets many papers and we tell her separate those papers between, like, stuff about cyber security and stuff about machine learning. Even if she doesn't know, like, the two concepts, maybe she can try to distinguish between the two. The problem is that the papers are in Hebrew and she doesn't know Hebrew, so she could maybe try and do so. If you give her like thousands of examples, maybe she can try and understand the hieroglyphs of Hebrew and try to understand which hieroglyphs are machine learning and which hieroglyphs are cybersecurity. But that that would again not scale up.

So instead we can introduce a baby. This is a Wonak or Wonak Cry. Won also doesn't speak Hebrew. He doesn't speak any language. He's a baby. But what what he does have is time because he's a baby and people are speaking Hebrew and English next to him all the time. Where does it meet us? Well, this is the old way.

Dean Langsam:
We used to do things like the first one is task one. Give the student a task to distinguish between two things, then give another student its task to distinguish between two other things. A baby can do something else. We can try and give it books like first, understand language, understand what's Hebrew, understand the relationships between words. Just understand the language. Then when you give them tasks, we can give them a lot less data to learn on the tasks instead of like giving it like the whole history of data for each different task. And you're probably starting to understand where we're going with this.

Dean Langsam:
This is again a Quiver and what quiver understands it can do is that Quiver is the baby. We have again in SentinelOne. We don't have a lot of labeled data about command lines, but we have a lot of command lines. So we can just ask Quiver, well, start reading those command lines and start to understand the language of command lines. Of course, this is not as very simple. We have many command line languages and stuff like that, but basically you can just tell it like start reading command lines.

Dean Langsam:
Um, the way we do this is by, I think we call the masked language model. And basically we give it like a sentence and then we hide one of the words or a few of the words and then we can try it like tell it based on that sentence with the hidden word, try to predict that word. That's the way the model learns. This is how we create like, we virtually create labeled data for the task of learning the language.

Dean Langsam:
Ah, now, now, when we learn the language, we can deploy it into different tasks such as like, classify, classify between different executables. We can do anomaly detection. We can of course try to do distinguish between malicious and benign command lines and so on and so forth.

Dean Langsam:
That's, of course, like we have a saying in the data community that given infinite time and infinite data, the model, will learn everything, but unfortunately we don't have infinite time or data. So we try to help our models. In our specific case, we try to take the command line wisdom and deploy some regex rules on it. So you can see that we are trying to mask different directory paths. We try, we, we, we can understand when we are seeing a local IP or a public IP, we can see when we have base64 strings and all those kinds of rules that we've created to help our model.

Gal Braun:
So given that we have this data set of command lines that we pre-processed and we want to feed it to the model, and now eventually, as we mentioned before, the model receives numbers, it needs somehow to translate these strings into vector of numbers that it can can process. So the building blocks of language, which is in our domain called tokens. Let's see how we can extract them.

Gal Braun:
So there are several approaches and the main one will be to dissect these strings into words by using several separators like slashes or whitespaces, which is great if you want to keep the high level entities. For example, argument names, you see that the argument name is still intact, but it makes our lives a little bit difficult when we want when we tackle new strings. For example, if we see a new command line with a new argument name, we need to handle it somehow because we don't see it in our vocabulary.

Gal Braun:
So a different approach will be. Just to split the whole command line into single characters and single chunks, which is the minimum amount which from one. So it mitigates the issue of unknown data that we we tackle. But it, it, it makes it more difficult to understand the higher level entities. And it will take the model a lot, a lot more time to learn.

Gal Braun:
So there is the middle ground, some cool concept that was popped up several years ago which called Subwords. And I won't get in too much into details how it's happening, but it allows us to dissect the text into generic blocks.

Gal Braun:
You can see that these hashtags double hashtags in some of the tokens, which mean it's an end of a word or a start of a word. And it's it's it gives us the, the, the, um, the good parts of both worlds.

Gal Braun:
So what we good output are some things we can can extract with these models is feeding them text for example, like a single token or a whole command line. And we can extract some vector of numbers that we can use for different tasks. And actually, as mentioned before, we are taking this command lines feed it to a model which learn the general way semantics about the command lines and then fine tune it to specific tasks. And during this learning phase it's optimizing some – it's called weights, some numbers inside of this model which will be different for each kind of the tasks so we can extract command lines, representations based on specific tasks that we are interested in.

Gal Braun:
Okay. This was an intro about the core concepts of this model and how it works. And let's see some examples of the output of the results that we got. So here's a nice blob. And we took millions of command lines and fed it to some model and let it just learn the semantics of command lines. Each one of these dots that you see here is a single token from the text that that the model extracted.

Gal Braun:
Now we can take a take a look inside of these tokens and see if it understands some semantics about the command lines. Each each one of the dots is a vector and this is a two dimensionality reduction of the results. So for example, here you can see a minus no profile token, which is a known PowerShell argument. On the left side, you can see it's a zoom in to the specific space location of minus, no profile inside of these tokens representations. And as you can see on the right, you can see that no profile and a token and the green ones are the ones that was mathematically the closest one to it. And on the right and the small table is the five, the most the five most closest tokens to the specific token.

Gal Braun:
As you can see, the top three, which was the closest ones, are different PowerShell arguments or syntax, which is awesome because it really understands something about tokens from PowerShell, PowerShell command lines and the bottom two is not related straight straight to PowerShell, but it's a different arguments. For example, the second from the bottom is a Java argument which again symbolizes that it learns something about arguments to executables, which is nice.

Gal Braun:
A second example regarding that is a different token, which is double hashtag dot VBS quotes, which means the end of a file path inside of an argument value. And as you can see in a similar way, you can see that the top three ones are different VBS tokens, but the rest of them are in the exactly in the same patterns but with different file extensions.

Gal Braun:
So it's dot js, dot bat, PL, JAR and so on. And it really understand that these patterns, these tokens are related inside the same space and give it similar vector numbers and which eventually led us to the conclusion, okay, we have something, it's not totally random and, and we can try and take this model and fine tune it to some task that we want.

Gal Braun:
So, so the most obvious thing that we can think about was trying to teach the model, whether a specific command line is malicious or benign. And what we did is, okay, so we have this baseline language model that learned the general semantics, but we want to fine tune it to this specific task. So firstly, we need some labels. Sentinelone got an MDR service which called Vigilance, which basically going through different cases, different threats that's happening in our customers computers and decide if a specific case is malicious or benign. And we use these cases to try and decide and extract some command lines that we know it would be malicious and vice versa.

Gal Braun:
So here you can see PowerShell command line from a specific malicious threat that was happening and the model actually signed it as malicious, which is cool. But these kind of models let you extract something even more, even more fruitful. You can. Try and extract for each one of the tokens how much it supported to the to the decision if a command line was malicious or benign.

Gal Braun:
So, for example, you can see here the different parts, that led the model to to decide this classification. So for example, here you can see the invoke web request inside of this PowerShell and some parts of the URL cause it to think this command line is malicious.

Gal Braun:
In a similar way here. Another two examples. The the middle one is another PowerShell malicious command line that the model decide what it was. It was malicious and you can see on the areas it focusing like for example, the non interactive token or there's like a it's a little bit faded but the sleep function in the end of of the PowerShell command line which it learned from the data that we fed it, what is malicious and might cause it to be a malicious command line.

Gal Braun:
And the third third example is a benign, entirely benign command line. It's just a win word exe executable that gave in some file path. And the model think it's very, very sorry, I didn't explain that the red parts are saying it's more malicious and the green ones led it to think it's more benign. And you see that the the the fact that the win word is the name of the executable and some string parts in the file name cause it to think it's it's a benign command line.

Gal Braun:
And so what can we do with this this model besides just predicting on a single command line? So firstly, we can just take this model and even if it's not 100% accurate and take it and just throw every command line from a customer environment through this model so it might have mistakes, but it can help us as hunters, for example, find our blind spots, reduce this, this all the areas that we might miss because there's a bunch of threats, a lot, a lot of information just going through our customers and environments.

Gal Braun:
And we have to focus somehow. So this tool can help hunters to focus on the areas that they might missing. And from other aspect, this kind of explanations to understand what causes these command lines to be more malicious or more benign can help us understand our customers information and make conclusions. And even, for example, we can try and let's write a YARA rule that specific fits for these kind of patterns that we see in on malicious command lines or, for example, command lines that the model usually think it's more malicious.

Gal Braun:
So this was one example. And the second one that we wanted to talk about was executable classification. And what we did is take our millions of command lines and split them by arguments and executable. And we fine tune the model to try and given a set of arguments to tell me which executable is it.

Gal Braun:
So another piece of art on the right side. You can see each one of these dots is another reduction to the dimensions of an argument, a set of arguments. And the color is the is the executable. And as you can see, this representation is is is excellent, is actually is very, very good. And most of the clusters are very uniform, which means it actually learns something about which arguments are relevant to which executable. And there are even more interestingly, there are clusters that are not unified which make us think, what are these clusters and what are these interesting command lines that look like different executables.

Gal Braun:
So here is just to have some a little bit more practical examples. You can see some of the clusters like main executable, like CMD or VPC, and actually a cool byproduct you can see at the top like three browsers, different browsers that arose in different clusters but was around the same area in these n-dimensional space. And but you can try and extract some cool information from these clusters, for example, some intent here, for example, a cluster that was based from mostly communication executables, or here you can see a cluster that most of the arguments inside was like Java arguments and one cmd. And if you print this cmd command line, it was actually execution of a Java, which is it actually makes sense. But this tool can be used to try and tag and understand the intent of specific command line without even looking at it. You can try and use this model to try and see a new command line that fell inside of one of these cluster to try and predict, okay, this cmd.exe, it did something that we know is maybe executing Java.

Gal Braun:
And and the last example here is you can see this big giant cluster is full of different PDF readers. And on the bottom you can see two example of CMD and MSEDGE that also opened PDF files and which again we can understand that these clusters, these representations in this cluster and we can tag it with some nice intent and try and predict for a specific command line.

Gal Braun:
So I'm sure that there is at least one person in this audience that think, do this stuff, can do, can solve this thing with regex, sit and try and, and write sophisticated patterns. But the awesome part of this model is just feed them a bunch load of data. You don't need to really fine tune it specifically for the task that you want. And as we mentioned, I think it was like the first day. More and more there are more and more attack vectors for third parties executables and this thing, if you like, keep feeding it more and more data, it will understand better the semantics of command line and easily can be fine tuned to the task that we want. And if the results would won't be good, we still have a saved spot in art school. And. And that's it. Thank you. Any questions?

Speaker3:
Yeah. Have you found any, like, openly available databases, systems with tons and tons of points relevant to this community that we could use for our own? Play on Machine learning and.

Gal Braun:
Do you mean? Like given these representations that were created, whether we found something that we can publish to the community and use it?

More like. Say I don't have the entire database of SentinelOne data to work against, but I do want something to put it against that threat. Researcher. Is there anything, any direction you would push me?

Dean Langsam:
Yeah. So this is currently like only the research phase, but the same way you can use Dall-e two. Although you're not an artist, probably we've never met. You're not an artist, you're not a poet, but you can use GPT three and you can use Dall-e two. Once we have like a working model, it should understand even like new stuff that are in that domain. So even if you give it like a new command line, if we trained it well, if you give it a new command line, it could say like the things that we've taught it to say in that way, if it if we prove it successful and actually good, then yeah, of course we can can do it.

Dean Langsam:
And one of the things that is fairly new in our world is that like Dall-e two is one specific implementation of a bigger academic paper that's called clip. And basically the thing that the most special thing that Dall-e two had is the data itself. But it gives you the data. Now if you say I have more data, I can start from that model. The model itself is open, open source. You can start from that model and train it on your own. I probably take you a lot of time. You need many GPUs, but like it's available to you. It's just a question of like time and money and not. Um, like a proprietary stuff and stuff like that. Yeah. So.

Gal Braun:
So it depends. It depends what you exactly want to achieve. Because overfitting it sounds like it's the worst nightmare for every data scientist, but it might be good for you if you specific want to find an abnormal activity in a specific customer. If you want the model to be fine tuned for a specific customer and extract information. It depends on the applications. And but yes, exactly.

I think one of the reasons we thought about, for example, normalizing paths or local IPS or base64, it was to ease the training. But also let's don't not fine tune into a specific IP or specific directory names so the road is still long before you get to something very mature that we can like publish publicly. But um, but yes, it's something that needs to be thought about and, and beyond that, like PII, for example, let's not give some attacker a option to my IP is something and it will complete it to some DNS server or whatever, something that's important to the customer. And. But yeah. Things to think about. Yeah.

Dean Langsam:
Uh, we're not product people. So once we show it to like the PMs, if they like it, like, as has shown, the part with the green and red parts is very cool to us. We'll customers find it useful. That's not on us, I think. I think it will be cool to show it, but again, the PMs will decide.

Thank you, guys.

Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.

Automatically convert your mp4 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.

Sonix has many features that you’d love including powerful integrations and APIs, collaboration tools, automated translation, automatic transcription software, and easily transcribe your Zoom meetings. Try Sonix for free today.

About the Presenters

Gal Braun is a data scientist at SentinelOne, working on Data Science & Machine learning focused on explainability, representation learning, and visualizations.

Dean Langsam is a data scientist at SentinelOne, working on the intersection of data science, machine learning, deep learning, language models, Python scientific programming, data visualizations, and Bayesian modeling.

About LABScon

This presentation was featured live at LABScon 2022, an immersive 3-day conference bringing together the world’s top cybersecurity minds, hosted by SentinelOne’s research arm, SentinelLabs.

Keep up with all the latest on LABScon 2023 here.