Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Scalable Dynamic Analysis of Binary Code
Linköping University, Department of Computer and Information Science, Database and information techniques. Linköping University, Faculty of Science & Engineering.
2019 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In recent years, binary code analysis, i.e., applying program analysis directly at the machine code level, has become an increasingly important topic of study. This is driven to a large extent by the information security community, where security auditing of closed-source software and analysis of malware are important applications. Since most of the high-level semantics of the original source code are lost upon compilation to executable code, static analysis is intractable for, e.g., fine-grained information flow analysis of binary code. Dynamic analysis, however, does not suffer in the same way from reduced accuracy in the absence of high-level semantics, and is therefore also more readily applicable to binary code. Since fine-grained dynamic analysis often requires recording detailed information about every instruction execution, scalability can become a significant challenge. In this thesis, we address the scalability challenges of two powerful dynamic analysis methods whose widespread use has, so far, been impeded by their lack of scalability: dynamic slicing and instruction trace alignment. Dynamic slicing provides fine-grained information about dependencies between individual instructions, and can be used both as a powerful debugging aid and as a foundation for other dynamic analysis techniques. Instruction trace alignment provides a means for comparing executions of two similar programs and has important applications in, e.g., malware analysis, security auditing, and plagiarism detection. We also apply our work on scalable dynamic analysis in two novel approaches to improve fuzzing — a popular random testing technique that is widely used in industry to discover security vulnerabilities.

To use dynamic slicing, detailed information about a program execution must first be recorded. Since the amount of information is often too large to fit in main memory, existing dynamic slicing methods apply various time-versus-space trade-offs to reduce memory requirements. However, these trade-offs result in very high time overheads, limiting the usefulness of dynamic slicing in practice. In this thesis, we show that the speed of dynamic slicing can be greatly improved by carefully designing data structures and algorithms to exploit temporal locality of programs. This allows avoidance of the expensive trade-offs used in earlier methods by accessing recorded runtime information directly from secondary storage without significant random-access overhead. In addition to being a standalone contribution, scalable dynamic slicing also forms integral parts of our contributions to fuzzing. Our first contribution uses dynamic slicing and binary code mutation to automatically turn an existing executable into a test generator. In our experiments, this new approach to fuzzing achieved about an order of magnitude better code coverage than traditional mutational fuzzing and found several bugs in popular Linux software. The second work on fuzzing presented in this thesis uses dynamic slicing to accelerate the state-of-the-art fuzzer AFL by focusing the fuzzing effort on previously unexplored parts of the input space.

For the second dynamic analysis technique whose scalability we sought to improve — instruction trace alignment — we employed techniques used in speech recognition and information retrieval to design what is, to the best of our knowledge, the first general approach to aligning realistically long program traces. We show in our experiments that this method is capable of producing meaningful alignments even in the presence of significant syntactic differences stemming from, for example, the use of different compilers or optimization levels.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2019. , p. 73
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1993
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:liu:diva-157626DOI: 10.3384/diss.diva-157626ISBN: 9789176850497 (print)OAI: oai:DiVA.org:liu-157626DiVA, id: diva2:1343472
Public defence
2019-09-25, Planck, Hus F, Campus Valla, Linköping, 13:15 (English)
Opponent
Supervisors
Funder
CUGS (National Graduate School in Computer Science)ELLIIT - The Linköping‐Lund Initiative on IT and Mobile CommunicationsAvailable from: 2019-08-22 Created: 2019-08-16 Last updated: 2019-08-22Bibliographically approved
List of papers
1. Efficient Utilization of Secondary Storage for Scalable Dynamic Slicing
Open this publication in new window or tab >>Efficient Utilization of Secondary Storage for Scalable Dynamic Slicing
2014 (English)In: Proceedings of the 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation / [ed] Randall Bilof, IEEE , 2014, p. 155-164Conference paper, Published paper (Refereed)
Abstract [en]

Dynamic program slicing is widely recognized as a powerful aid for e.g. Program comprehension during debugging. However, its widespread use has been impeded in part by scalability issues that occur when constructing the dynamic dependence graph necessary to compute dynamic slices. A few seconds of execution time on a modern CPU can easily yield dynamic dependence graphs on the order of tens of gigabytes in size. Existing methods either produce imprecise slices, incur large time overheads during slice computation, or run out of memory for long program executions. By carefully designing our method to take advantage of locality, we are able to efficiently use secondary storage for dynamic dependence graphs, thus allowing our method to scale to long program executions. Our prototype implementation runs directly on x86 executables, eliminating problems with e.g. Binary-only libraries. We show in our experiments that graphs can be constructed for program runs with billions of executed instructions, at slowdowns ranging from 62x to 173x. Our optimized format also allows graphs to be traversed at speeds of several million dependence edges per second.

Place, publisher, year, edition, pages
IEEE, 2014
Keywords
binary analysis, debugging, dynamic dependence graph, dynamic slicing, x86
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-117289 (URN)10.1109/SCAM.2014.24 (DOI)000358876700020 ()978-0-7695-5304-7 (ISBN)
Conference
14th IEEE International Working Conference on Source Code Analysis and Manipulation, Victoria, British Columbia, Canada, September 28-29, 2014
Available from: 2015-04-22 Created: 2015-04-22 Last updated: 2019-08-16
2. Towards Robust Instruction-Level Trace Alignment of Binary Code
Open this publication in new window or tab >>Towards Robust Instruction-Level Trace Alignment of Binary Code
2017 (English)In: PROCEEDINGS OF THE 2017 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE17), IEEE , 2017, p. 342-352Conference paper, Published paper (Refereed)
Abstract [en]

Program trace alignment is the process of establishing a correspondence between dynamic instruction instances in executions of two semantically similar but syntactically different programs. In this paper we present what is, to the best of our knowledge, the first method capable of aligning realistically long execution traces of real programs. To maximize generality, our method works entirely on the machine code level, i.e. it does not require access to source code. Moreover, the method is based entirely on dynamic analysis, which avoids the many challenges associated with static analysis of binary code, and which additionally makes our approach inherently resilient to e.g. static code obfuscation. Therefore, we believe that our trace alignment method could prove to be a useful aid in many program analysis tasks, such as debugging, reverse-engineering, investigating plagiarism, and malware analysis. We empirically evaluate our method on 11 popular Linux programs, and show that it is capable of producing meaningful alignments in the presence of various code transformations such as optimization or obfuscation, and that it easily scales to traces with tens of millions of instructions.

Place, publisher, year, edition, pages
IEEE, 2017
Series
IEEE ACM International Conference on Automated Software Engineering, ISSN 1527-1366
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-143959 (URN)10.1109/ASE.2017.8115647 (DOI)000417469700038 ()978-1-5386-2684-9 (ISBN)978-1-5386-3976-4 (ISBN)
Conference
32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)
Available from: 2017-12-29 Created: 2017-12-29 Last updated: 2019-08-16
3. Turning programs against each other: high coverage fuzz-testing using binary-code mutation and dynamic slicing
Open this publication in new window or tab >>Turning programs against each other: high coverage fuzz-testing using binary-code mutation and dynamic slicing
2015 (English)In: 2015 10TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE 2015) PROCEEDINGS, New York, NY, USA: Association for Computing Machinery (ACM), 2015, p. 782-792Conference paper, Published paper (Refereed)
Abstract [en]

Mutation-based fuzzing is a popular and widely employed black-box testing technique for finding security and robustness bugs in software. It owes much of its success to its simplicity; a well-formed seed input is mutated, e.g. through random bit-flipping, to produce test inputs. While reducing the need for human effort, and enabling security testing even of closed-source programs with undocumented input formats, the simplicity of mutation-based fuzzing comes at the cost of poor code coverage. Often millions of iterations are needed, and the results are highly dependent on configuration parameters and the choice of seed inputs. In this paper we propose a novel method for automated generation of high-coverage test cases for robustness testing. Our method is based on the observation that, even for closed-source programs with proprietary input formats, an implementation that can generate well-formed inputs to the program is typically available. By systematically mutating the program code of such generating programs, we leverage information about the input format encoded in the generating program to produce high-coverage test inputs, capable of reaching deep states in the program under test. Our method works entirely at the machine-code level, enabling use-cases similar to traditional black-box fuzzing. We have implemented the method in our tool MutaGen, and evaluated it on 7 popular Linux programs. We found that, for most programs, our method improves code coverage by one order of magnitude or more, compared to two well-known mutation-based fuzzers. We also found a total of 8 unique bugs.

Place, publisher, year, edition, pages
New York, NY, USA: Association for Computing Machinery (ACM), 2015
Keywords
Fuzz testing, fuzzing, black-box, dynamic slicing, program mutation
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-128810 (URN)10.1145/2786805.2786844 (DOI)000382568700067 ()978-1-4503-3675-8 (ISBN)
Conference
10th Joint Meeting on Foundations of Software Engineering
Available from: 2016-05-31 Created: 2016-05-31 Last updated: 2019-08-16
4. Speeding Up Bug Finding using Focused Fuzzing
Open this publication in new window or tab >>Speeding Up Bug Finding using Focused Fuzzing
2019 (English)In: Proceedings of the 13th International Conference on Availability, Reliability and Security, ACM Digital Library, 2019, article id 7Conference paper, Published paper (Refereed)
Abstract [en]

Greybox fuzzing has recently emerged as a scalable and practical approach to finding security bugs in software. For example, AFL — the current state-of-the-art greybox fuzzer — has found hundreds of vulnerabilities in popular software since its release in 2013. The combination of lightweight coverage instrumentation and a simple evolutionary algorithm allows AFL to quickly generate inputs that exercise new code. AFL also obviates the need to manually set ad-hoc fuzzing ratios, which has been a major limitation of classical black-box fuzzers. Instead, AFL's first fuzzing pass exhaustively applies a set of mutations to every byte of a program input. While this approach allows for more thorough exploration of the input space, and therefore improves the chances of finding complex bugs, it also drastically slows down the fuzzing progress for "heavyweight" programs, or programs that take large inputs. This makes AFL less suitable for fuzzing input formats with large size overhead, such as various document formats. In this paper, we propose focused fuzzing as a practical trade-off between thoroughness and speed, for fuzzers that employ input mutation. We extend the notion of code coverage to individual bytes of input, and show how forward dynamic slicing can be used to efficiently determine the set of program instructions that are affected by a particular input byte. This information can then be used to restrict expensive mutations to a small subset of input bytes. We implement focused fuzzing on top of AFL, and evaluate it on four "real-life" Linux programs. Our evaluation shows that focused fuzzing noticeably improves bug discovery, compared to vanilla AFL.

Place, publisher, year, edition, pages
ACM Digital Library, 2019
Keywords
fuzzing, AFL, dynamic slicing, focused fuzzing
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-152737 (URN)10.1145/3230833.3230867 (DOI)000477981800013 ()978-1-4503-6448-5 (ISBN)
Conference
13th International Conference on Availability, Reliability and Security, Hamburg, Germany, August 27 - 30, 2018
Available from: 2018-11-19 Created: 2018-11-19 Last updated: 2019-08-19

Open Access in DiVA

fulltext(3527 kB)1499 downloads
File information
File name FULLTEXT01.pdfFile size 3527 kBChecksum SHA-512
5823e5348685950fa3e92a69e7225e426eb8dbc425cd625a7493a797b4934d4b3b25aa14d04243f54926793c71f4dc0d74a1d805a598684f186951bcd82cd468
Type fulltextMimetype application/pdf
Order online >>

Other links

Publisher's full text

Search in DiVA

By author/editor
Kargén, Ulf
By organisation
Database and information techniquesFaculty of Science & Engineering
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 1499 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 5579 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf