• Home
  • Resources
    • Find Resources by Topic Tags
    • Cybersecurity Policy Chart
    • CSIAC Reports
    • Webinars
    • Podcasts
    • Cybersecurity Digest
    • Standards & Reference Docs
    • Journals
    • Certifications
    • Acronym DB
    • Cybersecurity Related Websites
  • Services
    • Free Technical Inquiry
    • Core Analysis Task (CAT) Program
    • Subject Matter Expert (SME) Network
    • Training
    • Contact Us
  • Community
    • Upcoming Events
    • Cybersecurity
    • Modeling & Simulation
    • Knowledge Management
    • Software Engineering
  • About
    • About the CSIAC
    • The CSIAC Team
    • Subject Matter Expert (SME) Support
    • DTIC’s IAC Program
    • DTIC’s R&E Gateway
    • DTIC STI Program
    • FAQs
  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Login / Register

CSIAC

Cyber Security and Information Systems Information Analysis Center

  • Resources
    • Find Resources by Topic Tags
    • Cybersecurity Policy Chart
    • CSIAC Reports
    • Webinars
    • Podcasts
    • Cybersecurity Digest
    • Standards & Reference Docs
    • Journals
    • Certifications
    • Acronym DB
    • Cybersecurity Websites
  • Services
    • Free Technical Inquiry
    • Core Analysis Task (CAT) Program
    • Subject Matter Expert (SME) Network
    • Training
    • Contact
  • Community
    • Upcoming Events
    • Cybersecurity
    • Modeling & Simulation
    • Knowledge Management
    • Software Engineering
  • About
    • About the CSIAC
    • The CSIAC Team
    • Subject Matter Expert (SME) Support
    • DTIC’s IAC Program
    • DTIC’s R&E Gateway
    • DTIC STI Program
    • FAQs
  • Cybersecurity
  • Modeling & Simulation
  • Knowledge Management
  • Software Engineering
/ Journal Issues / Tools & Testing Techniques for Assured Software – DoD Software Assurance Community of Practice: Volume 2 / Improving Software Assurance through Static Analysis Tool Expositions

Improving Software Assurance through Static Analysis Tool Expositions

Published in Journal of Cyber Security and Information Systems
Volume: 5 Number: 3 - Tools & Testing Techniques for Assured Software – DoD Software Assurance Community of Practice: Volume 2

Authors: Terry S. Cohen, Damien Cupif, Aurelien Delaitre, Charles D. De Oliveira, Elizabeth Fong and Vadim Okun
Posted: 11/02/2017 | Leave a Comment

The National Institute of Standards and Technology Software Assurance Metrics and Tool Evaluation team conducts research in static analysis tools that find security-relevant weaknesses in source code. This article discusses our experiences with Static Analysis Tool Expositions (SATEs) and how we are using that experience to plan SATE VI. Specifically, we address challenges in the development of adequate test cases, the metrics to evaluate tool performance, and the interplay between the test cases and the metrics. SATE V used three types of test cases directed towards realism, statistical significance, and ground truth. SATE VI will use a different approach for producing test cases to get us closer to our goals.

I. Introduction

Software assurance is a set of methods and processes to prevent, mitigate or remove vulnerabilities and ensure that the software functions as intended. Multiple techniques and tools should be used for software assurance [1]. One technique that has grown in acceptance is static analysis, which examines software for weaknesses without executing it [2]. The National Institute of Standards and Technology (NIST) Software Assurance Metrics and Tool Evaluation (SAMATE) project has organized five Static Analysis Tool Expositions (SATEs), designed to advance research in static analysis tools that find security-relevant weaknesses in source code. An analysis of SATE V in preparation of the upcoming SATE VI is reported here.

We first discuss our experiences with SATE V, including the selection of test cases, how to analyze the warnings from static analysis tools, and our results. Three selection criteria for the test cases were used: 1) code realism, 2) statistical significance, and 3) knowledge of the weakness locations in code (ground truth). SATE V used test cases satisfying any two out of the three criteria: 1) production test cases with real code and statistical significance, 2) CVE-selected test cases, with real code and ground truth, and 3) synthetic test cases with ground truth and statistical significance. We describe metrics that can be used for evaluating tool effectiveness. Metrics, such as precision, recall, discrimination, coverage and overlap, are discussed in the context of the three types of test cases.

Although our results from the different types of test cases in SATE V bring different perspectives on static analysis tool performance, this article shows that combining such perspectives does not adequately describe real-world use of such tools. Therefore, in SATE VI, we plan to produce test cases incorporating all three criteria, so the results will better reflect real-world use of tools. We discuss the approach we will use: injecting a large number of known, realistic vulnerabilities into real production software. Thus, we will have statistical significance, real code, and ground truth.

Background

Providing metrics and large amounts of test material to help address the need for static analysis tool evaluation is a goal of the National Institute of Standards and Technology (NIST) Software Assurance Metrics and Tool Evaluation (SAMATE) project’s Static Analysis Tool Exposition (SATE). Starting in 2008, we have conducted five SATEs.

SATE, as well as this article, is focused on static analysis tools that find security-relevant weaknesses in source code. These weaknesses, unless avoided or removed early, could lead to security vulnerabilities in the executable software.

SATE is designed for sharing, rather than competing, to advance research in static analysis tools. Briefly, a team led by NIST researchers provides a test set to toolmakers, invites them to run their tools, and they return the tool outputs to us. We then perform partial analysis of tool outputs. Participating toolmakers and organizers share their experiences and observations at a workshop.

The first SATE used open source, production programs as test cases. We learned that not knowing the locations of weaknesses in the programs complicates the analysis task. Over the years, we added other types of test cases.

One type, CVE-selected test cases, is based on the Common Vulnerabilities and Exposures (CVE) [3], a database of publicly reported security vulnerabilities. The CVE-selected test cases are pairs of programs: an older bad version with publicly reported vulnerabilities (CVEs) and a goodversion, that is, a newer version where the CVEs were fixed. For the CVE-selected test cases, we focused on tool warnings that correspond to the CVEs.

A different approach is computer-assisted generation of test cases. In SATE IV and V, we used the Juliet test suite [4], which contains tens of thousands of synthetic test cases with precisely characterized weaknesses. This makes tool warnings amenable to mechanical analysis. Like the CVE-selected test cases, there are both a bad version (code that should contain a weakness) and a good version (code that should not contain any weakness).

Initially, we had two language tracks: C/C++ and Java. We added the PHP track for SATE IV. In SATE V, we introduced the Ockham Criteria [5] to exhibit sound static analysis tools. Table 1 presents toolmaker participation over the years. The PHP track and the Ockham Criteria had one participant each in SATE V. Note, because SATE analyses grew in complexity and length, we changed from yearly SATEs (2008, 2009, and 2010) to the current nomenclature (IV, V, and VI).

Table 1: Number of tools participating per track over SATEs

Total C/C+ Java
2008 9 4 7
2009 8 5 5
2010 10 8 4
IV 8 7 3

Related Work

Software weaknesses can lead to vulnerabilities, which can be exploited by hackers. Definition and classification of security weaknesses in software is necessary to communicate and analyze tool findings. While many classifications have been proposed, Common Weakness Enumeration (CWE) is the most prominent effort [6, 7]. The Common Vulnerabilities and Exposures (CVE) database, comprised of publicly reported security vulnerabilities, was discussed in the Background section. While the CVE database includes specific vulnerabilities in production software, the CWE classification system lists software weakness types, providing a common nomenclature for describing the type and functionality of CVEs to the IT and security communities.

For example, CVE-2009-2559 is a buffer overflow vulnerability in Wireshark, which can be used by hackers to cause denial of service (DoS) [8]. CVE-2009-2559 is associated with two CWEs: CWE-126: Buffer Over-read [9], which is caused by CWE-834: Excessive Iteration [10]. The NIST National Vulnerability Database (NVD) described it using CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer [11, 12], which is a parent of CWE-126. We describe our use of CVEs and CWEs in our Methodology section.

Researchers have collected test suites and evaluated static analysis tools. Far from attempting a comprehensive review, we list some of the relevant studies here.

Kratkiewicz and Lippmann developed a comprehensive taxonomy of buffer overflows and created 291 test cases – small C programs – to evaluate tools for detecting buffer overflows [13]. Each test case has three vulnerable versions with buffer overflows just outside, moderately outside, and far outside the buffer, and a fourth, fixed, version. Their taxonomy lists different attributes, or code complexities, including aliasing, control flow, and loops, which may complicate analysis by the tools.

The largest synthetic test suite in the NIST Software Assurance Reference Dataset (SARD) [14] was created by the U.S. National Security Agency’s (NSA) Center for Assured Software (CAS). Juliet 1.0 consists of about 60 000 synthetic test cases, covering 177 CWEs and a wide range of code complexities [4]. CAS ran nine tools on the test suite and found that static analysis tools differed significantly with respect to precision and recall. Also, tools’ precision and recall ordering varied for different weaknesses. CAS concluded that sophisticated use of multiple tools would increase the rate of finding weaknesses and decrease the false positive rate. A newer version of the test suite, Juliet 1.2, correcting several errors and covering a wider range of CWEs and code constructs, was used in SATE V.

Rutar et al. ran five static analysis tools on five open source Java programs, including Apache Tomcat, of varying size and functionality [15]. Due to many tool warnings, they did not categorize every false positive and false negative reported by the tools. Instead, the tool outputs were cross-checked with each other. Additionally, a subset of warnings was examined manually. One of the conclusions by Rutar et al. was that there was little overlap among warnings from different tools. Another conclusion was that a meta-tool combining and cross-referencing output from multiple tools could be used to prioritize warnings [15].

Kupsch and Miller evaluated the effectiveness of static analysis tools by comparing their results with the results of an in-depth manual vulnerability assessment [16]. Of the vulnerabilities found by manual assessment, the tools found simple implementation bugs, but did not find any of the vulnerabilities requiring a deep understanding of the code or design.

Developing test cases is difficult. There have been many approaches. Zhen Li et al. developed VulPecker, an automated vulnerability detection system, based on code similarity analysis [17]. Their recent study focused on the creation of a Vulnerability Patch Database (VPD), comprised of over 1700 CVEs from nineteen C/C++ open source software. Their CVE-IDs are mapped to diff hunks, which are small files tracking the location of a given weakness and changes in source code across versions.

Instead of extracting CVEs from programs, some studies have looked at injecting vulnerabilities for static analysis tool studies. The Intelligence Advanced Research Projects Activity (IARPA) developed the Securely Taking On New Executable Software of Uncertain Provenance (STONESOUP) program [18] to inject realistic bugs into production software. The injected vulnerabilities were embedded in real control flow and data flow [19]. These seeded vulnerabilities were snippets of code showcasing a specific vulnerability. However, these embedded snippets were unrelated to the original source program, limiting realism in injected weaknesses. These test cases can be downloaded from the SARD [14].

In preparation for SATE VI, the SATE team looked extensively at related approaches. One important project was from the MIT Lincoln Laboratory, which developed a large-scale automated vulnerability (LAVA) technique to automatically inject bugs into real programs [20]. The program uses a “taint analysis-based technique” to dynamically identify sites that can potentially hold a bug, and user-controlled data that can be used at those vulnerable locations to trigger the weakness. Thus, the triggering input and the vulnerability are both known. LAVA can inject thousands of bugs in minutes. However, the tool alters the program data flow and only supports a small subset of CWE classes related to buffer overflow, therefore, limiting the realism of the injected weaknesses.

Another automated bug insertion technique is EvilCoder, developed by the Horst Görtz Institut, Germany [21]. Using a static approach, EvilCoder computes code property graphs from C/C++ programs to create a graph database, containing information about types, control flows and data flows. The program identifies paths that could be vulnerable, but are currently safe. Bug insertion is accomplished by breaking or removing security checks, making a path insecure. The limitation of this static analysis-based approach is that it does not produce triggering inputs to demonstrate the injected bugs.

II Test cases

Tool users want to understand how effective tools are in finding weaknesses in source code. Based on our SATE experiences, a perfect test case satisfies three criteria.

First, for tool results to be generally applicable, test cases should be representative of real, existing software. In other words, they should be similar in complexity to real software.

Second, for tool results to be statistically significant, the test cases must contain many different weakness instances of various weakness types. Since CWE has hundreds of weakness classes and the weaknesses can occur in a wide variety of code constructs, large numbers of test cases are needed.

Finally, to recognize tools’ blind spots, we need the ground truth – knowledge of all weakness locations in the software. In other words, without the ground truth we cannot know which weaknesses remain undetected by tools. Additionally, it greatly simplifies analysis of tool outputs by enabling mechanical matching, based on code locations and weakness types.

In summary, the three selection criteria for test cases are 1) realistic, existing code, 2) large amounts of test data to yield statistical significance, and 3) ground truth. Figure 1 illustrates these criteria. So far, we do not have test cases that satisfy all three criteria simultaneously. For SATE V, we have produced test cases satisfying any two out of the three criteria (Figure 1). We chose the following three types of test cases:

First, production software large enough for statistical significance and, by definition, representative of real software. However, the weaknesses in it are at best only partially known.

Second, a set of test cases (i.e., a test suite) mechanically generated, so that each test case contains one weakness instance embedded in a set of code complexities. We used the Juliet test suite, a diverse set of clearly identified weakness instances, for this set. This approach has ground truth and produces statistically significant results. However, the synthetic test cases may not be representative of real code.

Finally, CVE-selected test cases that contain vulnerabilities that were deemed important to be included in the CVE database. These test cases are real software and have ground truth. However, the determination of CVE locations in code is a time-consuming task, which makes it hard to achieve statistical significance.

Figure 1: Types of test cases

III Metrics

To measure the value of static analysis tools, we need to define metrics to decide which attributes and characteristics should be considered. For SATE analyses, we established a universal way of measuring the tools’ output objectively. The following metrics address several questions about tool performance.

First, what types of weaknesses can a tool find? Coverage is measured by the number of unique weakness types reported over the total number of weakness types included in the test set.

Second, what proportion of weaknesses can a tool find? Recall is calculated by dividing the number of correct findings (true positives) by the total number of weaknesses present in the test set, i.e., the sum of the number of true positives (TP) and the number of false negatives (FN). Recall = TP / (TP + FN)5 .

Third, what proportion of covered flaws can a tool find? Applicable recall (App.Recall) is recall reduced to the types of weaknesses a tool can find. It is calculated by dividing the number of true positives (TP) by the number of weaknesses in the test set, which are covered by a tool. In other words, a tool’s performance is not penalized if it does not report weaknesses that it does not look for (App.FN). App.Recall = TP / (TP + App.FN)

Fourth, how much can I trust a tool? Precision is the proportion of correct warnings produced by a tool and is calculated by dividing the number of true positives by the total number of warnings. The total number of warnings is the sum of the number of true positives (TP) and the number of false positives (FP). Precision = TP / (TP + FP)

Fifth, how smart is a tool? Bad and good code often look similar. It is useful to determine whether the tools can differentiate between the two. Although precision captures that aspect of tool efficiency, it is relevant only when good sites are prevalent over bad sites. When there is parity in the number of good and bad sites, e.g., in some synthetic test suites, a tool could indiscriminately flag both good and bad test cases as having a weakness and still achieve a precision of 50 %. Discrimination, however, recognizes a true positive on a particular bad test case only if a tool did not report a false positive on the corresponding good test case. A tool that flags every test case as flawed would achieve a discrimination rate of 0 %.

Finally, can tool findings be confirmed by other tools? Overlap represents the proportion of weaknesses found by more than one tool. The use of independent tools would find more weaknesses (higher recall), whereas the use of similar tools would provide a better confidence in the common warnings’ accuracy.

Table 2 summarizes the applicability of the metrics on the three types of test cases.

Table 2: Mapping metrics to test case types

Production Software Software w/ CVEs Synthetic Test Cases
Coverage Limited Limited Applicable
Recall N/A Applicable Applicable
Precision Applicable N/A Applicable
Discrimination N/A Limited Applicable
Overlap Applicable Applicable Applicable

Figure 1 summarizes the types of test cases. The mapping of their metrics is clearly delineated in Table 2. Production software has realism and statistical significance, but no ground truth. CVE-selected test cases have realism and ground truth, but no statistical significance. Synthetic test cases have statistical significance and ground truth, but no realism.

Precision and overlap can be calculated for production software test cases. However, due to the lack of ground truths, recall and discrimination cannot be determined, and only limited results for coverage can be obtained. In contrast, because the CVE-selected test cases are real software with ground truth, both recall and overlap can be calculated. However, because locating vulnerabilities is both difficult and time-consuming, precision cannot be determined, and limited results can be obtained for coverage and discrimination. Although these metrics are applicable to synthetic test cases (i.e., can be calculated), these cases may not generalize to real-world software.

IV Test Case Results

Methodology

This section focuses on SATE V test case results from the C/C++ track. For this track, we had selected two common open source software programs for the production software analyses: Asterisk version 10.2.0, an IP PBX platform2, and Wireshark version 1.8.0, a network traffic analyzer. Asterisk comprises over 500,000 lines of code; Wireshark contains more than 2 million lines of code. These test cases can be downloaded from the NIST Software Assurance Reference Dataset (SARD) [14]. For the CVE-selected test cases, we also asked toolmakers to run their tools on later, fixed versions of these test cases, using Asterisk version 10.12.2 and Wireshark version 1.8.7. We used the NSA CAS Juliet test set for the synthetic test cases [4].

Different methods were used to evaluate tool warnings depending upon the type of test case. As we discussed in Section II, synthetic test cases contain precisely characterized weaknesses. Metadata includes the locations where vulnerabilities occur, good and bad blocks of code, and CWEs. Consequently, the analysis of all warnings generated by tools is possible. For each test case, we selected tool findings if its CWE matched the corresponding test case’s CWE group.

As pointed out in Section II, finding the locations of CVEs in pairs of good and bad code was a time-consuming process. The metadata from production software is rich enough to demonstrate whether a tool found a CVE through automatic analysis. However, because CVEs were few in number and tools did not uniformly report vulnerabilities, we also conducted manual analyses. For each CVE, we selected the tool finding reported at the corresponding lines of code, only considering the finding if its CWE and the CVE’s CWE belonged to the same CWE group. Once found, an expert would confirm whether the automated analysis was correct. In addition to extracting CVE test cases this way, our experts also manually checked the code for matches missed by the algorithm. Our experts would rate the CVEs as having been precisely identified or coincidentally (indirectly) identified.

The analysis of production test cases was different. Analyses of tool warnings and reporting were often labor-intensive and required a high level of expertise. A simple binary true/false positive verdict on tool warnings did not provide adequate resolution to communicate the relationship of the warning to the underlying weakness [22]. Because of the large number of tool warnings and the lack of ground truth, we randomly selected warnings from each tool report, based on the weakness category and the security rating. After sampling 879 warnings and manually reviewing their correctness, we assigned each warning to a warning category. A security warning was related to an exploitable security vulnerability. A quality warning was not directly related to security, but it required a software developer’s attention. An insignificant classification referred to a true warning, but insignificant claim. A false warning rating corresponded to a false positive, and an unknown rating was one whose correctness could not be determined.

Results

SATE is not a competition. To prevent endorsement of the participating toolmakers, we anonymized data. The results generated from Tools A through H are reported here.

Figure 2 shows the precision vs. discrimination tool results for the synthetic test cases. The precision results are similar across all tools, whereas discrimination results are not. This is because the number of buggy sites is similar to the number of safe sites, as is the case for synthetic and CVE-selected test cases. Thus, discrimination is a better metric to differentiate tools. Note that for real software, most sites are safe and only a small proportion of sites are buggy, so precision would be very low if a tool reports a warning for every site, flawed or not.

Figure 2: Precision vs. discrimination tool results for the Synthetic test cases – Source: Author(s)

The synthetic test cases offer an excellent demonstration of tool efficiency. Table 3 combines metric results from testing of the Juliet synthetic test suite. Tool F demonstrated the highest applicable recall and discrimination, but displayed the lowest coverage. Tool B, on the other hand, exhibited the broadest coverage and lower discrimination than that of Tool F.

Table 3: Applicable recall, coverage, and discrimination for the Synthetic test cases – Source: Author(s)

Tool

App. Recall

Coverage

Discrimination

Tool A

21%

29%

74%

Tool B

25%

42%

86%

Tool C

18%

22%

70%

Tool D

8%

19%

47%

Tool E

19%

15%

92%

Tool F

56%

9%

93%

Tool G

2%

35%

45%

Tool H

25%

31%

64%

Figures 4 to 6 display the results for two metrics: recall and precision. The figures on the left provide a comparison of synthetic and CVE-selected test cases. The figures on the right provide a comparison of synthetic and production test cases. As examples, we use Tools B, H, and A to demonstrate the discrepancies between the results on different types of test cases. Recall was generally higher on synthetic test cases than in the CVE-related test cases. However, Tool A performed better with respect to CVEs in this case. Similarly, a comparison of the precision results indicates that the tools generated fewer false positives on the synthetic test cases than on the production test cases, leading to higher precision. Lower code complexity may account for the better recall and precision on the synthetic test cases compared to the CVE-related and production test cases.

Recall was generally higher on synthetic test cases than in the CVE-selected test cases. However, Tool A performed better with respect to CVEs in this case. Similarly, a comparison of the precision results indicates that the tools generated fewer false positives on the synthetic test cases than on the production test cases, leading to higher precision. Lower code complexity may account for the better recall and precision on the synthetic test cases compared to the CVE-selected and production test cases.

Figure 4: Recall for Synthetic vs. CVE test cases and precision for Synthetic vs. Production test cases – Source: Author(s)

Figure 5: Recall for Synthetic vs. CVE test cases and precision for Synthetic vs. Production test cases – Source: Author(s)

Figure 6: Recall for Synthetic vs. CVE test cases and precision for Synthetic vs. Production test cases – Source: Author(s)

Our examples illustrate the differences between the three types of test cases, making generalization challenging. For the production test cases, there was no ground truth, so tool recall could not be determined. Tools mostly reported different defects, so there was low overlap. Also, the results from synthetic cases may not generalize to real-world software. Clearly, characterizing a large set of CVE-selected test cases is very time consuming, so there was not enough test data collected for statistical significance. We will discuss a different approach in the context of our next SATE, SATE

VI Future SATE VI Plans

The lack of vulnerability corpora has always hampered researchers’ work in software assurance, because high quality test data is essential to achieve meaningful studies applicable to real-world software development. The real challenge does not solely lie in having test cases at our disposal, but rather to have them display specific criteria: ground truth, bug realism, and statistical significance.

Our main goal for SATE VI is to improve the quality of our test suites by producing test cases satisfying these three criteria. Time is a critical factor in the development or selection of new test cases, their use by toolmakers, and the subsequent analysis and reporting of results. CVE extraction yields real bugs, however there are too few CVEs to showcase numerous bugs in a single version of software. Having to run tools on multiple versions of large test cases is time consuming and can be problematic for SATE.

Manual bug injection enables a greater number and diversity in real bugs, but also takes time and effort. To prepare test cases for SATE VI, our team is using a semi-automated process. For each class of weaknesses that we want to insert, the first step is to automatically identify sites that are currently safe, but could become vulnerable with manual transformation, as in EvilCoder [21]. A site is a conceptual place in a program where an operation is performed and a weakness might occur. For example, for C programs, every buffer access is a site where a buffer overflow might occur.

The next step is to find execution paths leading to those sites. We will use guided fuzzing techniques to produce user inputs. Then, we will perform manual source code transformations, where the injected (or seeded) vulnerabilities will use the data flow and control flow of the original program. Finally, we will implement triggering and regression tests to demonstrate the injected bugs and check for conflicts between different injected bugs.

It is essential to understand that finding safe sites is much easier than finding vulnerable sites. Missing a safe site only represents the loss of one potential injected bug. To identify those sites, we must analyze our program the way a compiler does. To achieve this, we are analyzing the abstract syntax tree (AST) and extracting specific patterns. Ultimately, we want to use those sites to guide manual bug injection.

Identifying a site does not provide the input leading to it. We plan to use fuzzing tools to determine such input.

Our team will gather a set of CVEs and extract real-world insecure patterns to mimic production software vulnerabilities. Source transformations will be performed manually to reproduce common industry practices and yield realistic injected bugs. To achieve this, we will verify that the seeded vulnerabilities do not significantly alter the original data flow and control flow of the target program.

We must demonstrate that a given input leads to a real vulnerability. Manual bug injection requires much effort and high-level analysis to produce exploits. In fact, demonstrating exploitability is very challenging for static analyzers. Therefore, it is sufficient to demonstrate that our program exhibits abnormal behavior due to injected bugs. Consider this: an off-by-one buffer overflow will not always result in a program crashing, however, it can be validated using an assert statement.

VI Conclusion

In this article, we have discussed our experiences with SATE that can be useful for the software assurance community. Specifically, the article focused on the selection of test cases and how to analyze the output warnings from tools. We described metrics that could be used for evaluating tool effectiveness. Because tools report different weaknesses, there is little overlap in results.

SATE V covered three types of test cases: 1) production test cases, which had real code and statistical significance, 2) CVE-selected test cases, which had real code and ground truth, and 3) synthetic test cases, which had both ground truth and statistical significance. Although synthetic test cases cover a broad range of weaknesses, such test cases cannot be generalized to real-world software, like production cases. CVE extraction yields real bugs in production software, but it is both time-consuming and generates no statistical significance. Finally, static analysis tools can identify a large number of warnings in production software, which is real code. However, we do not know the location of all vulnerabilities, i.e., ground truth. Therefore, we require a better test suite, covering all three criteria for test cases.

Our main goal for future SATEs is to improve the quality of our analyses by producing test cases satisfying all three criteria. We believe inserting security-relevant vulnerabilities into real-world software can help us achieve this goal.

We learned through the study of three sophisticated and fully-automated injection techniques that the injected bugs are either insufficiently realistic [18, 20] or lack triggering inputs [21]. Purely manual injection has the benefit of yielding more realistic bugs, however it is time-consuming. Our team is considering a semi-automated process, speeding the discovery of potential sites, so we can perform manual source code transformations. In particular, we want to make sure that the seeded vulnerabilities do not significantly alter the data flow and control flow of the original program, and programming follows common development practices. Since demonstrating the injected bugs is essential, we will ensure that the injected bugs trigger abnormal program behavior.


Previous Article:
« SARD: Thousands of Reference Programs for Software...
Next Article:
Software Assurance Adoption through Open Source Tools »

References

  1. Larsen, G., Fong, E. K. H., Wheeler, D. A., & Moorthy, R. S. (2014, July). State-of-the-art resources (SOAR) for software vulnerability detection, test, and evaluation. Institute for Defense Analyses IDA Paper P-5061. Retrieved from http://www.acq.osd.mil/se/docs/P-5061-software-soar-mobility-Final-Full-Doc-20140716.pdf
  2. SAMATE. (2017). Source code security analyzers (SAMATE list of static analysis tools). Retrieved from https://samate.nist.gov/index.php/Source_Code_Security_Analyzers.html
  3. MITRE. (2017, July 20). Common vulnerabilities and exposures. Retrieved from https://cve.mitre.org/
  4. Center for Assured Software, U.S. National Security Agency (2011, December). CAS static analysis tool study - Methodology. Retrieved from http://samate.nist.gov/docs/ CAS_2011_SA_Tool_Method.pdf
  5. Black, P. E., & Ribeiro, A. (2016, March). SATE V Ockham sound analysis criteria. NISTIR 8113. https://dx.doi.org/10.6028/NIST.IR.8113. Retrieved from http://nvlpubs.nist.gov/nistpubs/ir/2016/NIST.IR.8113.pdf
  6. MITRE. (2017, June 6). Common weakness enumeration: Process: Approach. Retrieved from https://cwe.mitre.org/about/process.html#approach
  7. MITRE. (2017, June 7). Common weakness enumeration: About CWE. Retrieved from https://cwe.mitre.org/about/index.html
  8. MITRE. (2017). CVE-2009-2559. Retrieved from http://cve.mitre.org/cgi-bin/cvename.cgi?name=cve-2009-2559
  9. MITRE. (2017, May 5). CWE-126: Buffer over-read. Retrieved from http://cwe.mitre.org/data/definitions/126.html
  10. MITRE. (2017, May 5). CWE- CWE-834: Excessive iteration. Retrieved from http://cwe.mitre.org/data/definitions/834.html
  11. MITRE. (2017). CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer. Retrieved from http://cwe.mitre.org/data/definitions/119.html
  12. National Vulnerability Database, National Institute of Standards and Technology. (2010, August 21). CVE-2009-2559 Detail. Retrieved from https://nvd.nist.gov/vuln/detail/CVE-2009-2559
  13. Kratkiewicz, K., & Lippmann, R. (2005). Using a diagnostic corpus of C programs to evaluate buffer overflow detection by static analysis tools. Proceedings of the Workshop on the Evaluation of Software Defect Detection Tools, 2005. Retrieved from https://www.ll.mit.edu/ mission/cybersec/publications/publication-files/full_papers/ 050610_Kratkiewicz.pdf
  14. SAMATE, National Institute of Standards and Technology. (2017). Software Assurance Reference Dataset. Retrieved from https://samate.nist.gov/SARD/
  15. Rutar, N., Almazan, C. B., & Foster, J. S. (2004). A comparison of bug finding tools for Java. Proceedings of the 15th IEEE International Symposium on Software Reliability Engineering (ISSRE’04), France, November 2004. https://dx.doi.org/10.1109/ISSRE.2004.1
  16. Kupsch, J. A., & Miller, B. P. (2009). Manual vs. automated vulnerability assessment: A case study. In Proceedings of the 1st International Workshop on Managing Insider Security Threats (MIST-2009), Purdue University, West Lafayette, IN, June 15-19, 2009.
  17. Li, Z., Zou, D., Xu, S, Jin, H., Qi, H., & Hu, J. (2016). VulPecker: An automated vulnerability detection system based on code similarity analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications, pp. 201-213. https://dx.doi.org/10.1145/2991079.2991102
  18. De Oliveira, C., & Boland, F. (2015). Real world software assurance test suite: STONESOUP (Presentation). IEEE 27th Software Technology Conference (STC ‘2015) October 12-15, 2015.
  19. De Oliveira, C. D., Fong, E., & Black, P. E. (2017, February). Impact of code complexity on software analysis. NISTIR 8165. https://dx.doi.org/10.6028/NIST.IR.8165. Retrieved from http://nvlpubs.nist.gov/nistpubs/ir/2017/NIST.IR.8165.pdf
  20. Dolan-Gavitt, B., Hulin, P., Kirda, E., Leek, T., Mambretti, A., Robertson, W., Ulrich, F., & Whelan, R. (2016). LAVA: Large-scale automated vulnerability addition. In Proceedings of the 2016 IEEE Symposium on Security and Privacy, pp. 110-121. https://dx.doi.org/10.1109/SP.2016.15
  21. Pewny J., & Holz, T. (2016). EvilCoder: Automated bug insertion. In Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC’16), pp. 214-255. https://dx.doi.org/10.1145/2991079.2991103
  22. Black, P. E. (2012). Static analyzers: Seat belts for your code. IEEE Security & Privacy, 10(2), 48-52. https://dx.doi.org/10.1109/MSP.2012.2

Authors

Terry S. Cohen
Terry S. Cohen
National Institute of Standards and Technology, Gaithersburg, MD
Damien Cupif
Damien Cupif
National Institute of Standards and Technology, Gaithersburg, MD
Aurelien Delaitre
Aurelien Delaitre
National Institute of Standards and Technology, Gaithersburg, MD
Charles D. De Oliveira
Charles D. De Oliveira
National Institute of Standards and Technology, Gaithersburg, MD
Elizabeth Fong
Elizabeth Fong
National Institute of Standards and Technology, Gaithersburg, MD
Vadim Okun
Vadim Okun
Computer Scientist, National Institute of Standards and Technology, Gaithersburg, MD

Reader Interactions

Leave a Comment Cancel

You must be logged in to post a comment.

sidebar

Blog Sidebar

Featured Content

The DoD Cybersecurity Policy Chart

The DoD Cybersecurity Policy Chart

This chart captures the tremendous breadth of applicable policies, some of which many cybersecurity professionals may not even be aware, in a helpful organizational scheme.

View the Policy Chart

Featured Subject Matter Expert (SME): Daksha Bhasker

A dynamic CSIAC SME, Senior Principal Cybersecurity Architect, Daksha Bhasker has 20 years of experience in the telecommunications services provider industry. She has worked in systems security design and architecture in production environments of carriers, often leading multidisciplinary teams for cybersecurity integration, from conception to delivery of complex technical solutions. As a CSIAC SME, Daksha's contributions include several published CSIAC Journal articles and a webinar presentation on the sophiscated architectures that phone carriers use to stop robocalls.

View SME's Contributed Content

CSIAC Report - Smart Cities, Smart Bases and Secure Cloud Architecture for Resiliency by Design

Integration of Smart City Technologies to create Smart Bases for DoD will require due diligence with respect to the security of the data produced by Internet of Things (IOT) and Industrial Internet of Things (IIOT). This will increase more so with the rollout of 5G and increased automation "at the edge". Commercially, data will be moving to the cloud first, and then stored for process improvement analysis by end-users. As such, implementation of Secure Cloud Architectures is a must. This report provides some use cases and a description of a risk based approach to cloud data security. Clear understanding, adaptation, and implementation of a secure cloud framework will provide the military the means to make progress in becoming a smart military.

Read the Report

CSIAC Journal - Data-Centric Environment: Rise of Internet-Based Modern Warfare “iWar”

CSIAC Journal Cover Volume 7 Number 4

This journal addresses a collection of modern security concerns that range from social media attacks and internet-connected devices to a hypothetical defense strategy for private sector entities.

Read the Journal

CSIAC Journal M&S Special Edition - M&S Applied Across Broad Spectrum Defense and Federal Endeavors

CSIAC Journal Cover Volume 7 Number 3

This Special Edition of the CSIAC Journal highlights a broad array of modeling and simulation contributions – whether in training, testing, experimentation, research, engineering, or other endeavors.

Read the Journal

CSIAC Journal - Resilient Industrial Control Systems (ICS) & Cyber Physical Systems (CPS)

CSIAC Journal Cover Volume 7 Number 2

This edition of the CSIAC Journal focuses on the topic of cybersecurity of Cyber-Physical Systems (CPS), particularly those that make up Critical Infrastructure (CI).

Read the Journal

Recent Video Podcasts

  • A Brief Side-by-Side Comparison Between C++ and Rust – Part 3 Series: Programming Language Comparisons
  • A Brief Side-by-Side Comparison Between C++ and Rust – Part 2 Series: Programming Language Comparisons
  • A Brief Side-by-Side Comparison Between C++ and Rust – Part 1 Series: Programming Language Comparisons
  • Digital Engineering Implementation Progress and Plans Series: CSIAC Webinars
  • Assessing the Operational Risk Imposed by the Infrastructure Deployment Pipeline Series: The CSIAC Podcast
View all Podcasts

Upcoming Events

Thu 25

SANS Cyber Security East: Feb 2021

February 22 - February 27
Organizer: SANS Institute
Thu 25

19th USENIX Conference on File and Storage Technologies (FAST) 2021

February 23 - February 25
Organizer: USENIX
Jan 28

Data Privacy Day

January 28, 2022
Jan 28

Data Privacy Day

January 28, 2023
View all Events

Footer

CSIAC Products & Services

  • Free Technical Inquiry
  • Core Analysis Tasks (CATs)
  • Resources
  • Events Calendar
  • Frequently Asked Questions
  • Product Feedback Form

About CSIAC

The CSIAC is a DoD-sponsored Center of Excellence in the fields of Cybersecurity, Software Engineering, Modeling & Simulation, and Knowledge Management & Information Sharing.Learn More

Contact Us

Phone:800-214-7921
Email:info@csiac.org
Address:   266 Genesee St.
Utica, NY 13502
Send us a Message
US Department of Defense Logo USD(R&E) Logo DTIC Logo DoD IACs Logo

Copyright 2012-2021, Quanterion Solutions Incorporated

Sitemap | Privacy Policy | Terms of Use | Accessibility Information
Accessibility / Section 508 | FOIA | Link Disclaimer | No Fear Act | Policy Memoranda | Privacy, Security & Copyright | Recovery Act | USA.Gov

This website uses cookies to provide our services and to improve your experience. By using this site, you consent to the use of our cookies. To read more about the use of our site, please click "Read More". Otherwise, click "Dismiss" to hide this notice. Dismiss Read More
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.

SAVE & ACCEPT