Machine Learning For Intrusion Detection: Pavel Laskov

Machine learning for intrusion detection
Pavel Laskov
Fraunhofer FIRST.IDA Berlin, Germany
NATO ASI on MMDSS September 13, 2007
A short programming quiz

What does the following program do? A. prints out the stack end address B. reads a le and displays its content C. opens a new shell D. switches the user to root E. none of the above
#include <unistd.h> #include <stdio.h> #define BUFFER_SIZE 4096 #define TIME_DELAY 5 int main(int argc, char *argv[]) { char buffer[BUFFER_SIZE]; int i; for (i = 0; i < TIME_DELAY; i++) { printf("sleeping for 1 second...\n"); sleep(1); } printf("buffer begin: %p\n",buffer); printf("buffer end: %p\n",buffer+BUFFER_SIZE); FILE *f = fopen("tmp.txt","r"); if (f != NULL) fread((void*) buffer, 2*BUFFER_SIZE, 1, f); else printf("file 'tmp.txt' does not exist\n"); printf("%s",buffer);
Page 2
Exploit auctions: an example
Page 3
What can we learn from these examples?
Page 4
System complexity is our greatest enemy.
Page 5
System complexity is our greatest enemy. New attacks can be automated and distributed very fast.
Page 6
System complexity is our greatest enemy. New attacks can be automated and distributed very fast. Reaction time matters.
Page 7
System complexity is our greatest enemy. New attacks can be automated and distributed very fast. Reaction time matters. Hackers are extremely industrious.
Page 8
Intrusion detection systems
An intrusion detection system (IDS) is a hardware or software system monitoring event streams for evidence of attacks.
Coarse IDS taxonomy

By location

Network Host Application Signature matching Behavior analysis
By operation principles

Page 9
Signature-based IDS
Page 10
Signature-based IDS
Sensor: packet capture, defragmentation, TCP stream reassembly.
Sensor
Page 11
Signature-based IDS
Sensor: packet capture, defragmentation, TCP stream reassembly. Feature extractor: application protocol parsing, event stream generation.
Sensor
Feature extractor
Page 12
Signature-based IDS
Sensor: packet capture, defragmentation, TCP stream reassembly. Feature extractor: application protocol parsing, event stream generation. Signature engine: matching of data against a signature database.
Sensor
Feature extractor
Signature engine
Signature database
Page 13
Signature-based IDS
Sensor: packet capture, defragmentation, TCP stream reassembly. Feature extractor: application protocol parsing, event stream generation. Signature engine: matching of data against a signature database. Logger: storage and administration of alerts.
Logger
Sensor
Feature extractor
Signature engine
Signature database
Alert database
Page 14
Signature-based IDS
Sensor: packet capture, defragmentation, TCP stream reassembly. Feature extractor: application protocol parsing, event stream generation. Signature engine: matching of data against a signature database. Logger: storage and administration of alerts.
Logger
Sensor
Feature extractor
Signature engine
Signature database
Reactive unit
Reaction unit: e.g. blocking of connections from a malicious host.

Alert database Firewall
Page 15
Problems of signature-based IDS
Page 16

Detection of new attacks
Page 17

Detection of new attacks It usually takes < 24 hours before a signature for a new exploit is available.
Page 18


A ash worm may infect all vulnerable servers on the Internet within 15 minutes!
Page 19


False alarm rates
Page 20


False alarm rates After 10 false alarms per day, a system administrator will shut down an IDS.
Page 21


False alarm rates After 10 false alarms per day, a system administrator will shut down an IDS. On a high-speed link, e.g. 200MB/s, it sufces to have a false alarm rate of ca. 0.0001% to get 10 false alarms per day!
Page 22


False alarm rates After 10 false alarms per day, a system administrator will shut down an IDS. On a high-speed link, e.g. 200MB/s, it sufces to have a false alarm rate of ca. 0.0001% to get 10 false alarms per day! Performance
Page 23


False alarm rates After 10 false alarms per day, a system administrator will shut down an IDS. On a high-speed link, e.g. 200MB/s, it sufces to have a false alarm rate of ca. 0.0001% to get 10 false alarms per day! Performance Scanning each packet for ca. 10,000 signatures is costly.
Page 24


False alarm rates After 10 false alarms per day, a system administrator will shut down an IDS. On a high-speed link, e.g. 200MB/s, it sufces to have a false alarm rate of ca. 0.0001% to get 10 false alarms per day! Performance Scanning each packet for ca. 10,000 signatures is costly.

Cooling down such an IDS is even more costly!

Page 25
Why machine learning?
Page 26
Understanding of statistical foundations of empirical inference:

better accuracy, small false alarm rates.
Generalization error
emp
Function class complexity
Page 27


emp
Page 28

possible performance improvements

Learned detectors use a more compact representation:

Ability to detect novelty:

emp
protection against zero-day exploits, faster incident response times.
Page 29

possible performance improvements

Learned detectors use a more compact representation:
Why not machine learning?
Page 30
Non-vectorial and structured data: embedding of network data in metric spaces.
Page 31
Non-vectorial and structured data: embedding of network data in metric spaces. Difculty of getting labels:

unsupervised learning techniques.
Page 32

Malicious noise: a need for robust learning methods.
Page 33

Malicious noise: a need for robust learning methods. Huge data volumes:

importance of efcient algorithms.
Page 34
Conceptual structure of a learning-based IDS
Page 35

Sensors: standard packet capture and reassembly libraries (libpcap and libnids).
Sensor
Page 36

Sensors: standard packet capture and reassembly libraries (libpcap and libnids). Feature extractors: Packet header and window features. Sequential packet payload features: n-grams, words, subsequences.

Sensor
Feature extractor
Structural packet payload features: parse trees, key/value pairs.
Page 37


Sensor
Feature extractor
Detector
Signature database
Detectors: Anomaly detection. Classication. Clustering.
Page 38


Sensor
Feature extractor

Logging facility
Detector
Detectors: Anomaly detection. Classication. Clustering. Support infrastructure: synchronisation platform, GUI, logging facility.
Monitor
Synchronisation platform
Page 39
Feature extraction from network packets
Layer 3,4 protocols: fixed structure
Layer 5,6,7 protocols: variable structure Payload
Packet:
Header
Features:
flat features raw bytes parsed input flat features
Page 40
Embedding of sequences
Einbettung in hochdimensionalen Merkmalsra
Dimensionen entsprechen Merkmalsfrequenzen
Beispiel: 4-Gramme pro HTTP-Verbindung

0.015
A sequence x from an alphabet of cardinality N: x P A language L of pre-dened words: L & = fw j w

GET Acce
%%35 Acce
0.01
P g
0.005
An embedding function dened over the ... language: frequency count w (x) : for w in x binary ag
0 0
%%3
0.05
0.01
0.005
GET
Page 41
Similarity measures for sequences
Kernels, e.g.

Linear
w PL
w (x)x (y)
w PL
Distances, e.g.

Euclidean Manhattan Chebyshev
jw (x) x (y)j max w PL(w (x) x (y))

w PL
(w (x) x (y))2
Canberra, Hamming, Minkowski, Jensen...
Non-metric similarity coefcients, e.g.

Jaccard, Dice, Kulczynski, Socal-Sneath
Page 42
Abstract framework for similarity measures
Unied similarity measure: s(x ; y) =

w PL
m(x ; y ; w)
Reformulations of specic similarity measures: m(p; q) Kernel functions Linear Distances Euclidean Manhattan Chebyshev max pq (p q)2
jp qj pq
Page 43
Computation of similarity measures

Design factors:

Run-time linear in sequence length Efcient indexing of sparse dimensions Low run-time and space constants
Page 44
Computation of similarity measures

Design factors:

Run-time linear in sequence length Efcient indexing of sparse dimensions Low run-time and space constants
Data structures Hash table (Rieck, Laskov & Mller, DAGM 2006)

Trie (Rieck, Laskov & Mller, DAGM 2006) Generalized sufx tree (Rieck, Laskov & Sonnenburg, NIPS 2006) 64-bit array (Rieck & Laskov, submitted to JMLR) Compacted trie (Rieck & Laskov, submitted to JMLR) Sufx and LCP array (Rieck & Laskov, submitted to JMLR)

Page 45
Trie data structure

Trie = re-trie-val, prex or dictionary trie (Fredkin, 1960; Knuth, 1973): an edge-labeled N-ary tree for storing sequences of . Nodes are augmented with leaf and insersion counts. Example: X = fbar(2); car(3); card(1)g
6,3
Y = fband(3); bank(1); card(4)g

8,3
b
2,1
c
4,2
a
2,1
a
4,2
r
2,1
r
4,2
d
1,1
: # insertions : # leaves
b
4,2
c
4,1
a
4,2
a
4,1
n
4,2
r
4,1
d
3,1
k
1,1
d
4,1
Page 46
Trie matching algorithm
Parallel traversal of tries:

b c b c
r d
n k
Page 47

b c b c
r d
n k
Page 48

b c b c
r d
n k
Page 49

b c b c
r d
n k
Page 50

b c b c
r d
n k
Page 51

b c b c
r d
n k
Page 52

b c b c
r d
n k
Page 53

b c b c
r d
n k
Page 54

b c b c
r d
n k
Page 55

b c b c
r d
n k
Linear runtime complexity: O(nk log jj) % O(n)
Page 56
Anomaly detection algorithms
Center of mass
QSSVM 0.7 0.7
Linkage clustering
Qoppa 0.7
Zeta: local density

Zeta
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
s(xi ) = (K 1n K K1n + 1n K1n )ii
R (xi ) =
1 card(Ci )
k (xi ) = d(x ; nni (x))
d(nni (x); nnj (x))
Page 57
Incremental centering in feature space
Any element Kij can be expressed at the k-th iteration using auxiliary quantites F and G:
(k) (k) (k) Kij = Kij Gi Gj + F (k)
QSSVM 0.7
0.6
0.5
0.4
0.3
Auxiliary quantities can be recursively updated: F

(k) (k)
0.2
= =
Gi
(k1) (k1) k Gi
k1 k
(k1)
2 k2
k 1 i=1 Kki
1 K k2 kk
0.1
1 + k Kik
s(xi ) = (K 1n K K1n + 1n K1n )ii
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Run time: O(n) instead of O(n3) using explicit matrix multiplication.
Page 58
Incremental linkage clustering
Compare an incoming example with all previously stored examples: O(n). For those within the R-distance from an incoming example, recompute the score: O(n).
Qoppa 0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
R (xi ) =
1 card(Ci )
Page 59
Incremental Zeta algorithm
Find k nearest neighbors of an incoming example: O(nk). For each previous example, compare an incoming example with the end of the heap, insert if closer: O(n log k). If heap changed update the score; rst term: O(1), second term: O(k).
Zeta 0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
k (xi ) = d(x ; nni (x))
d(nni (x); nnj (x))
0.4
0.5
0.6
0.7
Page 60
Evaluation Objectives
Detection accuracy on unknown attacks. Best features and algorithms. Comparison with a signature-based IDS.
Datasets
PESIM 2005 Network trafc collected in a virtual environment at Fraunhofer FIRST. Attacks carried out by a penetration testing expert. DARPA 1999 The only publicly avialable dataset for IDS evaluation.
Page 61
Impact of various algorithms
Experimental protocol:

Evaluation criterion: AUC0:01 Average over n-grams from 1 to 8
Results: Protocol HTTP FTP SMTP Best algorithm AUC0:01 Center-of-mass 0.781 Local density 0.746 Linkage clustering 0.756
Conclusion: different protocols exhibit different notions of normality and abnomality.
Page 62
ROC curves: PESIM 2005 dataset
1 0.9 0.8 true positive rate
HTTP traffic
FTP traffic
1 0.9 0.8 true positive rate 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.01 0 0 0.002
SMTP traffic
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.002 Best combined Words Snort 0.004 0.006 0.008 false positive rate 0.01
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.002 Best combined Words Snort 0.004 0.006 0.008 false positive rate
Best combined Words Snort 0.004 0.006 0.008 false positive rate 0.01
Accuracy: Zeta: 8092% at 0% false positive rate Snort: 3590% at 0% false positive rate
Page 63
ROC curves: DARPA 1999 dataset
HTTP traffic
FTP traffic
1 0.9 0.8 true positive rate 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.01 0 0 0.002
SMTP traffic
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.002 Best combined Words Snort 0.004 0.006 0.008 false positive rate 0.01
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.002 Best combined Words Snort 0.004 0.006 0.008 false positive rate
Best combined Words Snort 0.004 0.006 0.008 false positive rate 0.01
Accuracy: Zeta: 5090% at 0% false positive rate Snort: 3080% at 0% false positive rate
Page 64
Conclusions and outlook
Machine learning is a viable tool for intrusion detection: Complementary to signature analysis.

Efcient for detection of unknown attacks. High detection accuracy with low false positives.
Page 65
Conclusions and outlook
Machine learning is a viable tool for intrusion detection: Complementary to signature analysis.

Efcient for detection of unknown attacks. High detection accuracy with low false positives. Detection of compromised computers. Detection and analysis of novel malware. Spam detection for VoIP (SPIT). ???
Further applications:

Page 66
NIPS 2007 Workshop

CALL FOR ABSTRACTS
NIPS 2007 W ORKSHOP ON M ACHINE L EARNING IN A DVERSARIAL E NVIRONMENTS FOR C OMPUTER S ECURITY 7 or 8 December, 2007 Whistler, British Columbia, Canada
Organizers: Pavel Laskov (Fraunhofer Institute FIRST, Germany) Richard Lippmann (MIT Lincoln Laboratory, USA) Deadlines: Extended abstract submission: Notication of acceptance: Contact: Email: Web page: October 19, 2007 November 2, 2007
mls-nips07@rst.fraunhofer.de mls-nips07.rst.fraunhofer.de
Computer and network security has become an important research area due to the alarming recent increase in hacker activity motivated by prot and both ideological and national conicts. Increases in spam, botnets, viruses, malware, key loggers, software vulnerabilities, zero-day exploits and other threats contribute to growing concerns about security. In the past few years, many researchers have begun to apply machine learning techniques to these and other security problems. Security, however, is a difcult area because adversaries actively manipulate training data and vary attack techniques to defeat new systems. A main purpose of this workshop is examine adversarial machine learning problems across different security applications to see if there are common problems, effective solutions, and theoretical results to guide future research, and to determine if machine learning can
Page 67
Thank you!
Questions?
Page 68
From detection to signatures
Positive examples
Positive signatures
Data source
Anomaly detector
freq. difference
Negative examples
Negative signatures
Objectives: Extract characteristic features of novel attacks. Use available label information for improvement of detection accuracy.
Page 69
1-gram frequency difference: HTTP Tunnel

HTTP tunnel connection \x00 0.026 positive frequency differences
0.021 \x02 0.015
0.010
0.005
0.000
1grams in attack instance
0x00 SSH padding during initial handshake 0x02 beginning of a data chunk in an HTTP Tunnel
Page 70
3-gram frequency difference: IIS Unicode attack (Nimda worm)

IIS unicode attack c.. 0.013 positive frequency differences 5c. 35c /.. ..% %35
0.011
0.008
0.005
0.003
0.000
%%35c is converted by a vulnerable IIS server to %5c (ASCII code 0x35 corresponds to 5) and nally interpreted as backslash (ASCII code 0x5c).
Page 71
6-gram frequency difference: WU-FTPd exploit

WUFTPd heap overflow \xff\xff\xff\xff\xff\xff \x18\x90\x90\x90\x90\xeb \x90\x90\xeb\x18\x90\x90
0.048 positive frequency differences
0.038
0.029
0.019
0.010
0.000
0xfffffff0 jump back into previous heap chunk 0x9090eb18 nop sled: 2 NOPs + jump 24 bytes (forces execution of malicious shell code)
Page 72

Machine Learning For Intrusion Detection: Pavel Laskov

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning For Intrusion Detection: Pavel Laskov

Uploaded by

Copyright:

Available Formats

Machine learning for intrusion detection

NATO ASI on MMDSS September 13, 2007

A short programming quiz

Exploit auctions: an example

What can we learn from these examples?

What can we learn from these examples?

System complexity is our greatest enemy.

What can we learn from these examples?

What can we learn from these examples?

What can we learn from these examples?

Intrusion detection systems

Coarse IDS taxonomy

Network Host Application Signature matching Behavior analysis

Sensor: packet capture, defragmentation, TCP stream reassembly.

Reaction unit: e.g. blocking of connections from a malicious host.

Problems of signature-based IDS

Problems of signature-based IDS

Problems of signature-based IDS

Problems of signature-based IDS

Problems of signature-based IDS

False alarm rates

Problems of signature-based IDS

Problems of signature-based IDS

Problems of signature-based IDS

Problems of signature-based IDS

Problems of signature-based IDS

Cooling down such an IDS is even more costly!

Why machine learning?

Why machine learning?

Understanding of statistical foundations of empirical inference:

better accuracy, small false alarm rates.

Function class complexity

Why machine learning?

Understanding of statistical foundations of empirical inference:

better accuracy, small false alarm rates.

Function class complexity

possible performance improvements

Learned detectors use a more compact representation:

Why machine learning?

Understanding of statistical foundations of empirical inference:

better accuracy, small false alarm rates.

Ability to detect novelty:

protection against zero-day exploits, faster incident response times.

Function class complexity

possible performance improvements

Learned detectors use a more compact representation:

Why not machine learning?

Why not machine learning?

Non-vectorial and structured data:  embedding of network data in metric spaces.

Why not machine learning?

unsupervised learning techniques.

Why not machine learning?

unsupervised learning techniques.

Malicious noise:  a need for robust learning methods.

Why not machine learning?

unsupervised learning techniques.

importance of efcient algorithms.

Conceptual structure of a learning-based IDS

Conceptual structure of a learning-based IDS

Conceptual structure of a learning-based IDS

Structural packet payload features: parse trees, key/value pairs.

Conceptual structure of a learning-based IDS

Non-vectorial and structured data: embedding of network data in metric spaces.

Malicious noise: a need for robust learning methods.

Detectors: Anomaly detection. Classication. Clustering.

jw (x) x (y)j max w PL(w (x) x (y))

(w (x) x (y))2

k (xi ) = d(x ; nni (x))

k (xi ) = d(x ; nni (x))