What can you get from 100 billion DNS queries, each day, in real time?
??????????1000????DNS???????????
?DNS????????????????C&C??????????
How Data Science and Machine Learning Help to Discover Botnets with
DNS: Moving Beyond Honeypots and Reverse Engineering
Hongliang Liu, Senior Staff Data Scientist, Nominum Inc.
Wooyun Summit 2016, Beijing, China
“You might say that DNS is in our DNA. Nominum invented DNS, has written 90% of
the world’s DNS code, and was the first to scale, secure, and leverage DNS to deliver
a whole new set of services. We are passionate about great Internet experience, high
quality code, and straightforward approaches to solving complex provider
challenges. Now we have harnessed DNS to allow providers to deliver extraordinary
value to their subscribers.”
Data Science team at Nominum
HongliangThanh
Alexey
MikaelAli
Paul and Yuriy
Yohai
About me
• A PhD degree in physics, plus two bachelor degrees in physics
and computer science.
• Not a big deal.
• No degree in security research.
• Not a big deal either.
• Building machine learning on security
• Have detected and blocked multiple botnets using data
science.
• Follow me @phunter_lau on weibo.
Photo courtesy: Danny Dong Photography
Outline
• When we talk about threat intelligence, what do we talk about?
• When we see threat, what should we do?
• Too much, that is too much information, meow~
• Intelligent machines are here to help
• Intelligent Anomaly Detection, Correlation map, DRS… with real life
examples
• So your machines want to replace human? or, we work together?
Should
be an amazing
talk!
My sister @dudulee????? ’s two cats, Pogo and Niba
Can’t
wait!
When we talk about Threat Intelligence, what does it mean?
Source: Movie “The Matrix” Source:
What I think I do What senior white hats think I do What I really do
Threat Intelligence != visualization
• “Hadouken” on a world map: pew, pew, pew~
• You know who I am talking about.
• Just don’t spend much of your time on making a
“hadouken” system
• even if your boss really loves it
• even if your boss thinks it is the future of the company.
• You data are much more valuable than that!
Threat Intelligence != Just see it
• Every company has/wants an anomaly
detection.
• Yes, I believe you see these too, but so what?
what can you do about them?
• You said “Big data”? please, no bullsh*t!
• “Big data” stays in 2012!
• “Big data” is “garbage in garbage out”
DNS traffic for second DNS traffic for another second
What is the problem with Threat Intelligence?
• Finding new threats are too hard and human
are not scalable.
• Too much information, too little time
• Too few security researchers
• Can’t copy-n-paste human
• Too….much under paid (well, this talk
can’t solve this problem)
New threat is needle in haystacks?
• Nope!
• Because you look too close.
“The closer you look…the less you will see.” — “Now you see me” the movie
Threat Intelligence is too general.
Let’s take DNS traffic for example
What is in this talk today
Threat Intelligence
Source
In a higher dimension world
• If we have … 100 Billion DNS queries per day in real
time?
• If we know … some domain names are new and had
never been seen before?
• If we know … the correlation between any two domain
names?
• If we know … the connection among any client IPs,
name servers, server IPs at any time?
Core idea in this talk is, how to do …
????
Dimension reduction attack
Data, all about data
• At Nominum, we receive and in real time process 100 Billion DNS
records per day from our global partner Internet Service Providers.
• From Nominum Vantio CacheServe, anonymized
• Sampled from trillion DNS queries served per day
• Data volume advantage or disadvantage
• Advantage: no need to wait for honeypot
• Disadvantage: If we have 100 billions records per day (~ 1 M per
second), what should we do? Just hire more security analysts?
• Or an intelligent way?
Source
Intelligent anomaly detection
• Anomaly detection (AD): an intelligent machine telling if
a domain name looks not normal in real time
• Never existed before?
• Too many weird queries?
• Too many subdomains?
• ……
• Engineering challenge: how to tell, in real time and
deterministically, if a domain never existed in the previous
months where each day had 100 B queries?
Anomaly Detection Web Interface screenshot
Correlation map: a learning machine for
any domain names
• Representation learning by artificial neural network from query sequence
• Similar technique as in Google’s AlphaGo (sorry for the marketing term)
• Introduce algebra to domain names for pair-wise similarity
• . “" x “" =
• for any two domains, even if just two queries!
• No need any a priori knowledge, nor name similarity!
• No need any feature about domain name itself, like randomness.
• Pair-wise similarity is the key to the new world. Correlation map Interface screenshot
Domain Reputation System (DRS)
• A giant dynamic graph of IP, name server, CNAME, registration
email, security list, correlated names etc
• Each entity is no longer individual but a graph connecting
everything
• Dynamic update.
• Not just a graph interface or “link analysis tool” or “eye-
candy”
• It is a backend system serving the decision engine.
• Hundreds of millions nodes, dozens of billions edges. DRS Interface screenshot
Shut up! Talk is cheap…
Botnet
Source:
Case study 1: rediscover known botnet C&C
• Identifying C&C clusters for dyre, suppobox, necurs,
qaksbot, gozi…..in real time and high precision, without a
priori knowledge ( DGA algorithms are not needed).
• Necurs and Bedep for example
• Necurs: silent on early June and reappeared later
• “old but not obsolete.”
• Bedep: a complicated DGA using currency exchange
• Hard to replay with known DGA algorithm Source:
Tech details for case 1: Necurs didn’t die!
• Necurs DGA renew every 4 days from June 1st.
• 2048 C&C names for each round per variant, for
recovering its p2p connections.
• 5 different variants are seen in the correlation map
clusters for the full day.
• Also, with Locky the ransomware reappearing on June
22nd.
• We know the exactly moment when you get back,
welcome :-) June 1st June 4th June 8th June 12th
All unique anomaly domain names count, not just Necurs
Necurs
variant 1
Necurs
variant 2
Necurs
variant 3
Necurs
variant 4
Gozi
variant 1
Gozi
variant 2
Bedep
• 5 variants of Necurs and
Bedep appeared at the
same time.
• Sizes of Necurs clusters
depend on traffic
• Bedep has around 100-200
domains each cluster
• all .com domains
• Uses currency
exchange rate as DGA
seeds
Necurs
variant 5
Suppobox
new variant
Case study 2: expanding threat list
• Just a few names are listed on security lists?
• No problem with DRS!
• Label propagation algorithm
• Fake software update sites for example.
• Only a few names on the security list
• . sunbelt border patrol list
• The same IPs of these names also link to new
suspicious names.
Case study 3: evolution of a botnet
• A fast evolving botnet having new domains every new hour
• Like bacteria: old domains expire when new domains are
added to the same cluster, within hours!
• Wait for human reverse engineering: are you sure you can
do it within one hour, and 24 hours per day?
• Gozi daily variant for example.
This is an animation!
Tech details for case 3: evolution of a botnet within hours
05:00 06:00 12:0011:00
• Red: newly added
names
• Aqua: old names
from last snapshot
Case study 4: you think fast flux is
smart?
• Fast flux: C&C domains switching in a set of C&C IPs
• A so called “smart way” trying to fool security
researchers.
• A fast flux botnet in DRS looks like…
• A dual subgraph, detectable by graph algorithms
• Freedom for security researchers!
• DRS knows all the history in the graph.
Case study 5: do you trust VirusTotal about “com—"?
Virus Total report for “.” as on 2016-06-15DRS report for “.” as on 2016-06-15
“I am virus”, such an honest phishing domain!
Case study 5: do you trust VirusTotal about “"
• Go get all subdomains
• em, —. more
intersting
• Go dig further?
Case study 5: do you trust VirusTotal about “"
• not just —
•
•
•
• Surely, one can check registration email from
WHOIS!
• That is how a white hat can research on
phishing domains with DRS
• I can play with it for hours!
Case study 5: do you trust VirusTotal about “"
• All about timing!
• This name later on appears on virus total on
June 20th 2016, 5 days later, by BitDefender.
• You know what 5 days mean in security.
• “5 days” can change an original new
discovery to “meh, me too”.
• DRS can see it at the first moment.
Virus Total report for “.” as on 2016-06-20
Case study 6: DRS + correlation map
• Correlation map only contains the “client IP - domains” graph
• With DRS, a more comprehensive structure of the full botnet
can be observed.
• Too much to visualize in a single graph for human
consumption
• We use community detection algorithm to find it and
block this group.
• This system is primarily for full auto machine detection.
• It is Necurs, by the way.
Now you see me
• It is the time to review.
• Workflow from 100 Billion DNS queries to
emerging threats
• How a full automated process
discovers threats in real time.
AD
Raw RT Data (Kafka)
RT Scanner
Customer
policies
Product
API
Feature
Pool
Black &
White
List
NOM
Rank
LSD
Clustering
Correlation
Extraction
Correl.
Models
Corr
AD List
Raw Data
Storage
Aggregated
Historical
Info
Model Data
Prep.
Corr
RT RSD
Detection
Actionable items for customer
Confidence column
Models
Corr
Data Mining
DRS
High level review of key tech: intelligent anomaly detection
• A great and intelligent anomaly detection
needs:
• Many features
• smart scoring/ranking functions
• A priori knowledge injection
• To be very sensitive, even if just a
single DNS query.
• Patent App. No. 14/937,678
MX query spike gives
spam bots, yeah!
ANY query spike is
amplification attack
A new tricky
amplification attack
using TXT query
DSN query type as an important feature for detecting multiple threats
High level review of key tech: correlation map
• A machine learning technique for learning the distributed representation
of each domain name.
• Learning to represent in the “machine - domain” bipartite graph
• Inspired by language embedding, . word2vec (see reference)
• Train and serve on a 2x14 core Intel Xeon CPU + 2x nVidia Titan X GPU
machine.
• We are a small team, you know.
• Written in C and Python.
• Patent App. No. 14/937,616 Source alert: some equations in the next two pages, apology in advance.
How to represent the correlation from a sentence?
?? ?? ?? ? ? ?? ??
??
Huffman encoding
??
??
??
??
?
?
??
??
[
[
[
[
[
[
[
[
log �(vTwc · vw) +
kX
i=1
Ewi ⇠ Pn(f)[log �(�vTwc · vw)]
p(c|w) =
X
p(wi|w) =
X
p( |w)“??”Optimization target
Loss function with negative sampling
How to represent the correlation from DNS sequence?
Huffman encoding
[
[
[
[
[
[
log �(vTwc · vw) +
kX
i=1
Ewi ⇠ Pn(f)[log �(�vTwc · vw)]
Optimization target
Loss function with negative sampling
p(c|w) =
X
p(wi|w) =
X
p(“|w)
“” x “” =
[ =
How to calculate correlation?
The algebra is as simple as dot product
Why learning representation for correlation?
• Representation learning converts the graph
walkthrough to simple algebra: the dot product,
cosine(theta)
• “” x “” =
• Yeah, all domain names become calculable!
• even if only 2 queries!
• No string feature needed!
• All your botnet C&C names belong to me!
A predicted Locky C&C
domain names on May 16th
The most correlated names
on the same day
• When have pair-wise correlation for
all names, cluster all names from
anomaly detection, we will have
botnets!
• Necurs is here!
• Bedep, yeah!
• more welcomed!
• Density peak clustering + manifold
learning (SNE) for visualizing in 2D.
High level review of key tech: DRS
• A dynamic graph model
• Hundreds of millions of nodes, dozens of billions of edges
• A giant graph database with caching layer
• Machine + human consumption
• Backend connecting correlation map, AD and other decision
engines, supporting automated decision making.
• Automatic detection and blocking.
• Front-end visualization as a convenient tool for human analyst
• Patent App. No. 14/937,699
A list of many other things from DNS data
• NomRank: ranking every domain names beyond Alexa
Rank
• IoT, machine to machine, botnet traffic etc
• Discovering DNS tunneling
• Can be associated with possible APT attacks
• Recategorizing pornography websites: parental control
• Many many others
The sky is not your limit. Your imagination is.
Source
Bonus: Locky the ransomware
• Locky ransomware’s unique C&C signatures
• 12 names each two days
• a list of known TLDs
• We can guess the Locky DGA seed from the
clusters.
• Locky has a flaw of DGA algorithm
• Predict future C&C and block all of them
• Good lesson to Locky creator: don’t skip math
courses in college, otherwise you will learn it in a
hard way.
Locky was detected by DRS too
10/10 star malicious
Domain reputation system backend output screenshot
What we do after detecting botnet? just sinkhole? no…
• We block them all. Yes, we can!
• Deploy to Nominum’s N2 ThreatAvert GIX block list in near real time
• ISPs/companies using ThreatAvert can have full protection
• Queries to C&C names return NXDOMAIN
• Can inform subscribers/endusers about their infection using N2 Reach
• To know more, please go to
So you want to replace security researchers?
• Your “intelligent machines” are so cool, so no more jobs for we
human security researchers, right?
• Automation: Friend, not Foe.
• Intelligent machines are yet another tools for fighting against
botnet and malware.
2012/07/
How it works with honeypot and reverse engineering
• Information exchange with honeypot
• honeypots as seeds/ground truth for correlation map/DRS input
• correlation map/DRS request specific honeypots
• A security researcher has 100 binaries to reverse engineer, which
one to start first?
• From DNS data, we see large clusters with many client IPs and
huge impact to the global, let’s start with this one.
Honeypot
Reverse engineering
Machine intelligence
Source
How do we collaborate?
• You don’t want to build another anomaly detection, or DRS or
correlation map for yourself
• It is very much work, too many pitfalls there.
• We might later on have API/webUI for our white hat friends.
• AD, DRS, correlation map etc
• We might build models for you if we share data with us.
• Please find our contact information in a later slide.
• Or ask me for a business card. Source
Recap: what does “Threat Intelligence” mean to us?
Recap: what does “Threat Intelligence” mean to us?
• Threat Intelligence is not just fancy visualization.
• Threat Intelligence doesn’t stop at anomaly detection.
• Gear up Threat Intelligence with machine automation
• Intelligent machines can discover the underlying
rules for we human.
• As an instance of Information Retrieving (IR): more
automation, more condensed information for human
consumption.
Source
NomRank
DRS AD CorrMap
Model-K Model-N
Takeaways: 3 things
• Threat Intelligence needs intelligent machines!
• Data science with DNS data for example
• Cool new tools with DNS data and machine learning
• Beyond honeypot and reverse engineering
• Gear up and add another dimension for fighting against
malware, with machine intelligence
• Cross fire with honeypot and reverse engineering
• Possible sharing of the data, model and API
• Information exchange with honeypot and reverse engineering. Twitter: @AndrewYNg
Patents mentioned in this talk
• Analyzing DNS Requests for Anomaly Detection. Patent App. No. 14/937,678
• DNS-Based Ranking of Domain Names. Patent App. No. 14/937,656
• System for Domain Reputation Scoring. Patent App. No. 14/937,699
• System for Correlation of Domain Names. Patent App. No. 14/937,616
Reference
• M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. “Building a Dynamic Reputation System for DNS”. In 19th
Usenix Security Symposium, 2010
• Manos Antonakakis, Roberto Perdisci, Yacin Nadji, Nikolaos Vasiloglou, Saeed Abu-Nimeh, Wenke Lee and David Dagon,
“From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware”. In 21st Usenix Security Symposium, 2012
• Leyla Bilge, Engin Kirda, Christopher Krue gel, and Marco Balduzzi. “EXPOSURE: Finding Malicious Domains Using
Passive DNS Analysis”, In TISSEC 2014
• Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. “Distributed Representations of Words and Phrases
and their Compositionality”, In NIPS 2013
• Grégoire Mesnil, Xiaodong He, Li Deng and Yoshua Bengio. “Investigation of Recurrent-Neural-Network Architectures and
Learning Methods for Spoken Language Understanding”, Interspeech, 2013.