Professional Documents
Culture Documents
Silvio Cesare
School of Information Technology Deakin University Burwood, Victoria 3125, Australia
<silvio.cesare@gmail.com>
ABSTRACT
Malware continues to be a significant problem facing computer use in todays world. Historically Antivirus software has employed the use of static signatures to detect instances of known malware. Signature based detection has fallen out of favour to many, and detection techniques based on identifying malicious program behavior are now part of the Antivirus toolkit. However, static approaches to malware detection have been heavily researched and can employ modern fingerprints that significantly improve on the simple string signatures used in the past. Instancebased learning can allow the detection of an entire family of malware variants based on a single signature of static features. Statistical machine learning can turn the features extracted into a predictive Antivirus system able to detect novel and previously unseen malware samples. This paper surveys the approaches and techniques used in static malware detection.
raw content. Thus, traditional signatures can prove ineffective when dealing with unknown variants. Modern approaches to signature generation involve less fragile and more versatile fingerprints. Program features are extracted that enable a more robust representation to detect an entire family of malware variants. Machine learning and statistical classification using those same program features can allow the detection of novel and unknown malware not belonging to previously identified families. Static program analysis is undecidable for many problems concerning binaries, and a transformation of a compiled program known as code packing is often used by malware authors to hide the intent of the malware and make analysis more difficult. The packing process encrypts, compresses, or obfuscates the malware. The original unobfuscated code is restored at run time, or in the case of instruction virtualization, a byte code representing the original code is executed. In most cases, unpacking is a requirement for effective static malware classification and use of signatures. Automated unpacking has been partly successful but for those cases where it cannot be achieved, it is sometimes better to mark those programs as likely to be malicious. Thus, even with packed samples, static detection of malware can still be an effective tool.
Keywords
Malware classification.
1. INTRODUCTION
Malicious software is a significant problem that threatens the security of users on the internet. Today, malware is created by criminal gangs for the purposes of financial gain. These criminals employ malware for the purposes of stealing of credit card information to commit fraud or to obtain illegal use of a computer to launch spam campaigns. A simple approach often used by criminals on victims to is by having innocent users open an EMail attachment that is malicious. To protect users from malware, detection of the threat before it is allowed to execute its malicious intent is a necessity. Behaviour blocking is a useful approach, but relying solely on the dynamic behaviour of a program may allow unwanted actions to be performed before the malware is detected. Running a program in a virtual machine or isolated sandbox to detect its intent is not always effective. Dynamic analysis can never reason about all potential behaviours. If the malware performs differently while being analysed, or can detect the analysis itself, then the malware has a high probability to escape detection. Static analysis and detection provides a possible solution in the arsenal of defences. Static signature based detection has been a dominant feature in Antivirus. Because of performance constraints, the most widely used signature is a string containing patterns of the raw file content [1, 2]. This allows for a string search [3] to quickly identify patterns associated with known malware. However, these patterns can easily be invalidated because minor changes to the malware source code have significant effects on the malwares
8d 83 ff 55 89 51 83 e8 c7 eb c7 e8 83 83 7e 83 59 5d 8d c3
4c 24 04 e4 f0 71 fc e5 ec 6a 45 10 04 5d 45 7d ea c4 24 00 00 00 f8 00 00 00 00 24 00 f8 f8 24 a0 20 40 00 00 00 01 09
61 fc
lea and pushl push mov push sub call movl jmp movl call addl cmpl jle add pop pop lea ret
0x4(%esp),%ecx $0xfffffff0,%esp -0x4(%ecx) %ebp %esp,%ebp %ecx $0x24,%esp 4011b0 <___main> $0x0,-0x8(%ebp) 40115f <_main+0x2f> $0x4020a0,(%esp) 4011b8 <_puts> $0x1,-0x8(%ebp) $0x9,-0x8(%ebp) 40114f <_main+0x1f> $0x24,%esp %ecx %ebp -0x4(%ecx),%esp
lea and pushl push mov push sub call movl jmp movl call addl cmpl jle add pop pop lea ret
0x4(%esp),%ecx $0xfffffff0,%esp -0x4(%ecx) %ebp %esp,%ebp %ecx $0x24,%esp 4011b0 <___main> $0x0,-0x8(%ebp) 40115f <_main+0x2f> $0x4020a0,(%esp) 4011b8 <_puts> $0x1,-0x8(%ebp) $0x9,-0x8(%ebp) 40114f <_main+0x1f> $0x24,%esp %ecx %ebp -0x4(%ecx),%esp
2.2 Bytes
One of simplest features that can be extracted from a program is the raw byte level content of the malware executable file [4]. An alternative source of content comes from the individual program sections in the binary, including the code and data segments.
is a directed graph representing the inter-procedural control flow. Like the control flow graph, alternative or abstracted representations are possible such as dominator trees.
2.3 Instructions
An executable program is constructed of code and data. The code is represented as assembly language. Extracting the assembly is the process of disassembling. The instruction level content of a program can represent a more resilient form than the byte level content if the instructions are considered by their type or mnemonic representation [5].
lea and pushl push mov push sub call movl jmp
0x4(%esp),%ecx $0xfffffff0,%esp -0x4(%ecx) %ebp %esp,%ebp %ecx $0x24,%esp 4011b0 <___main> $0x0,-0x8(%ebp) 40115f <_main+0x2f> Proc_1 movl call addl $0x4020a0,(%esp) 4011b8 <_puts> $0x1,-0x8(%ebp)
Proc_0
Proc_3
cmpl jle
Proc_4
Proc_2
Figure 2. Control flow graph (left) and call graph (right). byte and instruction stream may change when minor semantic alterations are made to the malware source code. The advantage of byte level content as a program feature is that the dependence on accurate static analysis of the programs semantics or structure is not required. If the instruction stream is used, additional challenges are presented because it is known that perfect disassembly of an unknown image is undecidable on the x86 platform [12]. To avoid the problems of syntactic polymorphism, higher level abstractions of the program can be used. The control flow features including control flow graphs and call graphs are considered more invariant in polymorphic malware than byte and instruction level content [8]. However, opaque predicates - conditions that always evaluate to the same result but are hard to determine statically may result in these features being altered. The detection of opaque predicates has been investigated, but it is not evident that this is entirely satisfactory, and a sound method of detection against all unknown predicates is not possible. For example, it is known that some algorithms which are used to construct predicates are actually only strong conjectures in evaluating to the same result. This implies an automated approach to prove that it constant is hard. The presence of pointers and indirection in assembly language also present problems to static analyses which may not have the precision required to construct a control flow graph or call graph with the degree of accuracy required for malware classification. For all its disadvantages, control flow has shown to be an effective feature that is invariant in most current malware. The use of API calls is another approach to solve the syntactic polymorphism problem. This approach has problems with malware that obscures the use of those calls, as is the case of the stolen bytes technique [13] introduced by code packing tools. Data flow analysis is another high level abstraction but when used in the presence of pointers is compounded by the problems that static analyses must face. The procedure and system dependence graphs have similar problems with pointers and indirection even when data dependencies of pointers are ignored. The dependence graphs are also dependent on accurate modelling of the instruction sequence. This avoids problems such as register reassignment because the data dependency is represented as a graph. The problem occurs with the modelled instructions used in the data dependencies which may be polymorphic and variant. Polymorphism is not handled effectively in this situation although code normalization may help.
classified by identifying a high similarity to a known instance of malware in the training set. Traditional Antivirus utilises this approach when it performs signature based detection. The key component to perform classification using instance-based learning is a distance or similarity function between the objects representing samples and queries. For a distance function to be effective between objects, the objects must be modeled by a limited set of features that capture the invariant characteristics of the malicious and benign programs. In some cases, the distance function is replaced with a test for equality. However, testing only for equality reduces the effectiveness of the classification process when dealing with malware variants. Instance-based learning can additionally identify high similarity to benign or white-listed samples, depending on the aims of the classification.
removes the redundancy of the original code and improves the terseness, resulting in a normalized representation. An approach using term rewriting was proposed in [19] where rewrite rules were constructed to model the malware transformations that occur during polymorphic and metamorphic mutation. From these, a normalizing rule-set was constructed that could rewrite the malware to a canonical or near canonical representation.
The main disadvantage with this approach is that minor changes to the malware source code can result in significant changes to almost all basic blocks. Changes in compiler configuration and optimisations can equally result in large changes.
An alternative approach to using vectors to represent API call sequences was proposed in the IMDS malware detection system [10]. IMDSs approach employed the use of the data mining technique known as association mining. Association mining was able to associate sequences of API calls to classify query samples as benign or malicious.
6. TRENDS
Malware obfuscation has been increasingly addressed by researchers, and deobfuscation will continue to be developed and incorporated into malware detection systems. These deobfuscation techniques have increasingly borrowed from formal program analyses in an attempt to make sound analyses possible in regards to their given constraints. Malware classification has employed statistical techniques to detect unknown malware. We believe research will continue using this approach and new features will be developed that can more accurately characterize malware. Instance-based learning will also be developed with particular research opportunity in working with large scale datasets. Static program features have been extracted at increasing levels of abstraction, and we expect this to continue in future research. Abstraction has the benefit of being resistant to lower level polymorphic changes. The performance of these research systems has not been fully investigated, and we expect that future research opportunity lies in making classification systems practical for industrial and widespread use.
7. CONCLUSION
Detecting malware before it is allowed to execute is an important feature of Antivirus and system security. Static analysis techniques allow feature extraction of programs which allows machine learning to identify variants of malware and novel samples. Malware packing which hides the code from analysis remains the main sticking point for static detection and it can be hard to reverse all packers automatically. If unpacking is achievable, the problem of malware detection using static analysis is quite feasible and we expect the accuracy and efficiency of such systems will continue to improve as research continues.
REFERENCES
[1] K. Griffin, S. Schneider, X. Hu, and T. Chiueh, "Automatic Generation of String Signatures for Malware Detection," in Recent Advances in Intrusion Detection: 12th International Symposium, RAID 2009, Saint-Malo, France, 2009. J. O. Kephart and W. C. Arnold, "Automatic extraction of computer virus signatures," in 4th Virus Bulletin International Conference, 1994, pp. 178-184.
[2]
[3] [4]
[9] [10]
[11]
[15] [16]
[17]
A. V. Aho and M. J. Corasick, "Efficient string matching: an aid to bibliographic search," Communications of the ACM, vol. 18, p. 340, 1975. J. Z. Kolter and M. A. Maloof, "Learning to detect malicious executables in the wild," in International Conference on Knowledge Discovery and Data Mining, 2004, pp. 470-478. D. Bilar, "Opcodes as predictor for malware," International Journal of Electronic Security and Digital Forensics, vol. 1, pp. 156-168, 2007. M. Gheorghescu, "An automated virus classification system," in Virus Bulletin Conference, 2005, pp. 294300. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: principles, techniques, and tools. Reading, MA: Addison-Wesley, 1986. C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna, "Polymorphic worm detection using structural information of executables," Lecture notes in computer science, vol. 3858, p. 207, 2006. E. Carrera and G. Erdlyi, "Digital genome mapping advanced binary malware analysis," in Virus Bulletin Conference, 2004, pp. 187-197. Y. Ye, D. Wang, T. Li, and D. Ye, "IMDS: intelligent malware detection system," in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007. M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant, "Semantics-aware malware detection," in Proceedings of the 2005 IEEE Symposium on Security and Privacy (S&P 2005), Oakland, California, USA, 2005. R. N. Horspool and N. Marovac, "An approach to the problem of detranslation of computer programs," The Computer Journal, vol. 23, pp. 223-229, 1979. L. Boehne, "Pandoras Bochs: Automatic Unpacking of Malware," University of Mannheim, 2008. G. Wicherski, "peHash: A Novel Approach to Fast Malware Clustering," in Usenix Workshop on LargeScale Exploits and Emergent Threats (LEET'09), Boston, MA, USA, 2009. S. Wehner, "Analyzing worms and network traffic using compression," Journal of Computer Security, vol. 15, pp. 303-320, 2007. Y. Zhou and W. M. Inge, "Malware detection using adaptive data compression," in Proceedings of the 1st ACM workshop on Workshop on AISec (AISec '08), 2008, pp. 53-60. M. Christodorescu, J. Kinder, S. Jha, S. Katzenbeisser, and H. Veith, "Malware normalization," University of Wisconsin, Madison, Wisconsin, USA Technical Report #1539, 2005.
[18]
[19]
[20]
[21]
[22]
[23]
[24] [25]
[26] [27]
[31]
D. Bruschi, L. Martignoni, and M. Monga, "Using code normalization for fighting self-mutating malware," presented at the Proceedings of International Symposium on Secure Software Engineering, 2006. W. Andrew, M. Rachit, R. C. Mohamed, and L. Arun, "Normalizing Metamorphic Malware Using Term Rewriting," presented at the Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation, 2006. M. E. Karim, A. Walenstein, A. Lakhotia, and L. Parida, "Malware phylogeny generation using permutations of code," Journal in Computer Virology, vol. 1, pp. 13-23, 2005. R. Perdisci, A. Lanzi, and W. Lee, "McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables," in Proceedings of the 2008 Annual Computer Security Applications Conference, 2008, pp. 301-310. G. Bonfante, M. Kaczmarek, and J. Y. Marion, "Morphological Detection of Malware," in International Conference on Malicious and Unwanted Software, IEEE, Alexendria VA, USA, 2008, pp. 1-8. S. Cesare and Y. Xiang, "Classification of Malware Using Structured Control Flow," in 8th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2010), 2010. S. Cesare and Y. Xiang, "Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs," in IEEE Trustcom, 2011. R. T. Gerald and A. F. Lori, "Polymorphic malware detection and identification via context-free grammar homomorphism," Bell Labs Technical Journal, vol. 12, pp. 139-147, 2007. T. Dullien and R. Rolles, "Graph-based comparison of Executable Objects (English Version)," in SSTIC, 2005. X. Hu, T. Chiueh, and K. G. Shin, "Large-Scale Malware Indexing Using Function-Call Graphs," in Computer and Communications Security, Chicago, Illinois, USA, pp. 611-620. A. H. Sung, J. Xu, P. Chavez, and S. Mukkamala, "Static analyzer of vicious executables (save)," 2004, pp. 326-334. G. Salton and M. J. McGill, Introduction to modern information retrieval: McGraw-Hill New York, 1983. M. Christodorescu and S. Jha, "Static analysis of executables to detect malicious patterns," presented at the Proceedings of the 12th USENIX Security Symposium, 2003. F. Leder, B. Steinbock, and P. Martini, "Classification and Detection of Metamorphic Malware using Value Set Analysis," in Proc. of 4th International Conference on Malicious and Unwanted Software (Malware 2009), Montreal, Canada, 2009.