One way to understand the strengths and limitations of software assurance tools is to use a corpus of programs with known bugs. The software developer can run a candidate tool on programs in the corpus to get an idea of the kinds of bugs that the tool finds (and does not find) and the false positive rate. The Software Assurance Reference Dataset (SARD)  at the National Institute of Standards and Technology (NIST) is a public repository of hundreds of thousands of programs with known bugs. This article describes the content of SARD, how to find specific material, and ways to use it.
SARD has over 170,000 programs in C, C++, Java, PHP, and C# covering more than 150 classes of weaknesses. Most of the test cases are synthetic programs of a page or two of code, but there are over 7,000 full size applications derived from a dozen base applications. Although not every vulnerability is indicated in every program, the vast majority of weaknesses are noted in metadata, which can be processed automatically. Users can search for test cases by language, weakness type, and many other criteria and can then browse, select, and download them.
The term “bug” is ambiguous. “A vulnerability is a property of system security requirements, design, implementation, or operation that could be accidentally triggered or intentionally exploited and result in a security failure. A vulnerability is the result of one or more weaknesses in requirements, design, implementation, or operation.” [14, page 4] In isolation, a piece of code may have a buffer overflow or command injection weakness, but because the input is filtered or only comes from a trusted source, it may not constitute a vulnerability, which is exploitable. In fact, it may be difficult to determine if a particular piece of code may be reachable at all. It may, in practice, even be dead code. Hence, we usually talk about weaknesses and leave larger, system level concerns for another discussion.
We first explain the goals and organization of SARD, then describe the very diverse content. After that, we give advice on how to find and use SARD cases, related work and collections, and future plans for SARD.
SARD Philosophy and Organization
The SARD consists of test cases, which are individual programs. Each test case has “metadata” to label and describe it. Many test cases are organized into test suites. Some test cases share common files with other cases.
The code is typical quality. It is not necessarily pristine or exemplary, nor is it horrible. SARD is not a compiler test. For now, we ignore the question of language version, e.g., C99 vs. C11.
Users can search for test cases by programming language, weakness type, size, and several other criteria and can then browse, select, and download them. Users can access test suites, which are collections of test cases. We explain more in the section explaining how to use SARD content.
Many synthetic programs represent thousands of variations for different weakness classes. In theory only the code pertaining to the weakness need be examined to determine that it is, indeed, a weakness. However, analysis tools must handle an unbounded amount of surrounding code to find sources of sinks, determine conditions when the piece of code may be executed, etc. Many sets of synthetic programs have the same base weakness wrapped in different code complexities. For instance, an uninitialized variable may be declared in one function and a reference passed to another function, where it is used. Other code complexities are when the weakness is wrapped in various types of loops or conditionals or uses different data types.
Test cases are labeled “good,” “bad,” or “mixed.” A “bad” test case contains one or more specific weaknesses. A “good” test case is associated with bad cases, but the weaknesses are fixed. Good cases can be used to check false positives. A “mixed” test case has both, for instance, code with a weakness and the same code with the weakness fixed. Weaknesses are classified using the Common Weakness Enumeration (CWE)  ID and name. We plan to also list their Bugs Framework (BF)  class. Fig. 1 shows the result of searching for test cases. Clicking on 199265 displays that case, as shown in Fig. 2.
SARD is archival. That is, once a test case is deposited, it will not be removed or changed. That way, if research uses a test case, later researchers can access that exact test case, byte for byte, to replicate the results. This is important to determine if, say, a new technique is more powerful than a previous technique.
If problems are later found with a test case, a new version may be added. For instance, if an extraneous error is found in a test case or the test case uses an obsolete language feature, an alternate may be added that corrects the problem. The original can still be accessed, but its status is deprecated.
A test case is deprecated if it should not be used for new work. If a test case does not yet meet our documentation, correctness, and quality requirements, its status is candidate. When it does, its status is accepted. A user can expect that the documentation of an accepted test case contains:
- A description of the purpose of the test case.
- An indication that it is good (false alarm), bad (true positive), or mixed.
- Links to any associated test cases, e.g. the other half of a bad/good test case pair.
- The source code language, e.g. C, Java, or PHP.
- Instructions to compile, analyze, or execute the test, if needed. This may include compiler name/version, compiler directives, environment variable definitions, execution instructions, or other test context information.
- The weakness(es) class(es).
- If this is a “bad” or “mixed” test, the location of known weaknesses, e.g. file name and line number.
Source code for an accepted test case will:
- Compile (for compilable languages).
- Run without fatal error, other than those expected for an incomplete program.
- Not generate any warnings, unless the warnings are expected as part of the test.
- Contain the documented weakness if the test case is a bad or mixed test case.
- Contain no weaknesses at all if it is a good case.
Figure 1. A screen shot of test cases found for a search. It shows that cases have “metadata” or information such as, test case ID, source code language, status, description, weaknesses included, and an indication of whether it has weaknesses (bad), no weaknesses (good), or mixed.
Figure 2. Screen shot of code from case 199625. The NULL pointer dereference weakness shows up on lines 101, 103, and other lines. Each case has such “metadata” available for for automatic processing.
We have permission to publicly furnish the SARD test cases. In fact, many test cases are in the public domain. We are working to attach explicit usage rights to each case and each suite.
SARD was designed to support almost one billion test cases. Cases are organized in directories of one thousand. For instance, when downloaded, the path to case 1320984 is testcases/001/320/984. That subdirectory may contain a single file, or it may contain many files and subdirectories for a large, complex case.
Since it is not clear what the perfect test suite would be (or if there is one!), we gathered many different test case collections from many sources. This section describes the collections to provide an idea of the kinds of cases that are currently available. First we describe the large collections of synthetic cases generated by programs. Next we describe the collections of cases written by hand. Finally we describe cases from production code. Table 1 gives a very general idea of all SARD cases listing the number of cases in each language. (The counts in the table do not include deprecated cases.) Fig. 3 gives a better idea of the quantity, size, and source of cases in each language.
Table 1. Number of SARD test cases in each programming language as of 30 July 2017.
|Language||Number of Cases|
Figure 3. Number of test cases by language, clustered by lines of code. The Y axis is the mean lines of code of test cases in the cluster. Synthetic cases (SYN), in which all weaknesses are known, are yellow circles. Cases with weaknesses injected (INJ) into production code are orange triangles. Production code (PRO), which have some weaknesses identified, are green squares. The size of circles, triangles, and squares is the logarithm of number of test cases of that cluster; larger is more test cases.
By far the largest number of test cases are synthetic. One of our first collections came from MIT Lincoln Laboratory. They developed a taxonomy of code complexities and 291 basic C programs representing this taxonomy to investigate static analysis and dynamic detection methods for buffer overflows. Each program has four versions: a “good” version, accessing within bounds, and three “bad” versions, accessing just outside, moderately outside, and far outside the boundary of the buffer. These 1164 cases are explained in Kratkiewicz and Lippmann  and are designated test suite 89.
In 2011, the National Security Agency’s Center for Assured Software (CAS) generated thousands of test cases in C/C++ and Java covering over 100 CWEs, called Juliet 1.0. (This was the tenth major SARD contribution and was named for the tenth letter of the International Radiotelephony Spelling Alphabet, which is “Juliet.”) They can be compiled individually, in groups, or all together. Each case is one or two pages of code. They are grouped by language, then by CWE. In each CWE, base programs, using versions of printf() or different data types, are elaborated with up to 30 variants having complexities added. The following year they extended the collection with version 1.1, described in Boland and Black . The latest version is Juliet 1.2, which comprises 61 387 C/C++ programs and 25 477 Java programs for almost two hundred weakness classes. They are test suites 86 (C/C++) and 87 (Java). The Juliet 1.0 and 1.2 suites are further described in documents at https://samate.nist.gov/SARD/around.php.
Following an architecture developed by NIST personnel and under their direction, a team of students at TELECOM Nancy, a computer engineering school of the Université de Lorraine, Nancy, France, implemented a generator that created many PHP cases. After that, other students rewrote the generator to be more modular and extensible, under guidance of members of the NIST Software Assurance Metrics And Tool Evaluation (SAMATE) team. They created a suite of 42 212 test cases in PHP covering the most common security weakness categories, including XSS, SQL injection, URL redirection, etc. These are suite 103 and are documented in Stivalet and Fong . In 2016, SAMATE members oversaw additional work, again by TELECOM Nancy students, who created a suite of 32 003 cases in C#. These cases are suite 105.
Manually Written Cases
Many companies donated synthetic benchmarks that they developed manually. Fortify Software Inc., now HP Fortify, contributed a collection of C programs that manifest various software security flaws. They updated the collection as ABM 1.0.1. These 112 cases cover various software security flaws, along with associated “good” versions. These are test suite 6. In 2006, Klocwork Inc. shared 41 C and C++ cases from their regression suite. These are all a few lines of code to demonstrate use after free, memory leak, use of uninitialized variables, etc. Toyota InfoTechnology Center (ITC), U.S.A. Inc. created a benchmark in C and C++ for undefined behavior and concurrency weaknesses. The test suite, 104, has 100 test cases containing a total of 685 pairs of weaknesses. Each pair has a version of a function with a weakness and a fixed version of the function. For more details see . The test cases are © 2012-2014 Toyota InfoTechnology Center, U.S.A. Inc., distributed under the “BSD License,” and added to SARD by permission. The SAMATE team noted coincidental weaknesses.
SARD also includes 329 cases from our static analyzer test suites . These have suites for weaknesses, false positives, and weakness suppression in C (test suites 100 and 101), C++ (57, 58, and 59), and Java (63, 64, and 65).
SARD includes many small collections of synthetic test cases from various sources. Frédéric Michaud and Frédéric Painchaud, Defence R&D Canada, created and shared 25 C++ test cases. These test cases cover string and allocation problems, memory leaks, divide by zero, infinite loop, incorrect use of iterator, etc. These are test suite 62. Robert C. Seacord contributed 69 examples from “Secure Coding in C and C++” . John Viega wrote “The CLASP Application Security Process”  as a training reference for the Comprehensive, Lightweight Application Security Process (CLASP) of Secure Software, Inc. SARD initially included 36 cases with examples of software vulnerabilities from use of hard-coded password and unchecked error condition to race conditions and buffer overflow. Many of the original cases have been improved and replaced. Hamda Hasan contributed 15 cases in C#, including ASP.NET, with XSS, SQL injection, command injection, and hard coded password weaknesses.
Cases From Production Software
All the cases described to this point were a few pages of code at most and were written specifically to serve as focused tests. Small synthetic cases may not show if a technique scales or if an algorithm can handle production code with complicated, interconnected data structures over thousands of files and variables. To fill this gap, SARD has cases that came from operational code.
MIT Lincoln Laboratory extracted 14 program slices from popular Internet applications (BIND, Sendmail, WU-FTP, etc.) with known, exploitable buffer overflows . That is, they removed all but a relatively few functions, data structures, files, etc. so the remaining code (“the slice”) has the overflow. They also made “good” (patched) versions of each slice. These 28 test cases are in SARD as test suite 88.
The Intelligence Advanced Research Projects Activity (IARPA) Securely Taking On New Executable Software Of Uncertain Provenance (STONESOUP) program created test suites in three phases. The goal of STONESOUP was to fuse static analysis, dynamic analysis, execution monitoring, and other techniques to achieve orders of magnitude greater assurance. For Phase 1 they developed five collections of small C and Java programs covering five vulnerabilities: memory corruption and null pointer dereference for C, and injection, numeric handling, and tainted data for Java. Each collection may be downloaded from the SARD Test Suites page and includes directions on how to compile and execute them and inputs that trigger the vulnerability. The test cases for Phase 2 were not particularly different from the Phase 1 cases.
For Phase 3, STONESOUP injected thousands of weakness variants into 16 widely-used web applications, resulting in 3188 Java cases and 4582 C cases. The weaknesses covered 25 classes such as integer overflow, tainted data, command injection, buffer overflow, and null pointer. Each case is accompanied with inputs triggering the vulnerability, as well as “safe” inputs. Because the cases represent thousands of copies of full-sized applications, IARPA STONESOUP Phase 3 is distributed as a virtual machine with a complete testing environment: the base applications, all libraries needed to compile them, difference (delta) files with flaws, and a Test and Evaluation eXecution and Analysis System (TEXAS) to compile an executable from a difference file and the base app, binaries to monitor the execution, triggering and safe inputs, and expected outputs. This material, as well as results of STONESOUP, are described in documents available at https://samate.nist.gov/SARD/around.php.
The STONESOUP base applications are significant enough by themselves that we describe them here. They are available from the Test Suite page as Standalone apps. These 15 apps are GNU grep, OpenSSL, PostgreSQL, Tree (a directory listing command), wireshark, Coffee MUD (Multi-User Dungeon game), Elastic Search, Apache Subversion, Apache Jena, Apache JMeter, Apache Lucene, POI (Apache Java libraries for reading and writing files in Microsoft Office formats), FFmpeg (a program to record, convert, and stream audio and video), and Gimp (GNU Image Manipulation editor). Application ID 16, JTree, is different from the others. It is a smaller form of STONESOUP, that is, a single base case injected with weaknesses. The base case is a Java version of Tree. When processed with unzip, ID 16 produces 34 subdirectories, each with difference files to create a version of JTree with an injected weakness. Running the included generate_application_testcases.py creates different versions of JTree, along with test material.
For Static Analysis Tool Expositions (SATE) , SAMATE members tracked vulnerabilities reported through Common Vulnerability and Exposures (CVE)  to source code changes. This resulted in 228 CVEs in WordPress, Openfire, JSPWiki, Jetty, Apache Tomcat, Wireshark (1.2 and 1.8), Dovecot, Chrome, and Asterisk. Each of these programs has its own test suite. Each CVE has one test case that contains the file or files with the vulnerabilities. The first test case in each suite has all the CVEs, files, and identified vulnerabilities for that program. These 10 test suites represent hundreds of reported, known vulnerabilities and the corresponding source code.
How to Use SARD Test Cases and Test Suites
The first step is to decide what test case properties are important to your situation. Programming language is the most obvious characteristic. Clicking on the “Search” tab, you may search SARD by many criteria, such as programming language, type of weakness, words in the description, type (bad, good, or mixed), status, and test case IDs, as shown in Fig. 4. Test case IDs may be ranges or lists. The type of weakness is matched to CWE descriptions as you type. You can search for and select those weaknesses that are most crucial in your situation.
Figure 4. SARD search page. User may search for test cases meeting any combination of these criteria. Source: https://samate.nist.gov/SARD/index.php
Matching test cases are displayed as in Fig. 1. You may browse, select, and download any or all of the resulting cases. The download is a zip file containing a manifest (an XML listing of test cases and weakness locations) and the cases in a hierarchical directory structure of thousands described earlier.
In the File Search page, you can search for cases having files with certain names, sizes, or numbers of files, as shown in Fig. 5. This kind of search is useful if you are trying to find, say, very large test cases. We search by file name to find where test files come from or to find related cases, which often have files with similar names.
The SARD Test Suites page lists stand-alone suites, which are very large, test suites that are collections of test cases, and web and mobile applications. The web and mobile sections are for large (full-sized) applications that we will host in those domains. Standalone apps currently consists of STONESOUP base cases, described previously. The test suites page also has links to old collections that have been superseded.
Figure 5. SARD file search page. User may search for test cases having files with particular names, files of particular sizes (minimum, maximum, or both), and particular numbers of files. User may give a regular expression to match file names, but a regex search is far slower. Source: https://samate.nist.gov/SARD/index.php
Paraphrasing Boland et. al. , many test suites, such as Juliet, are structured so that all the small test cases can be analyzed or compiled as a single, large program. This helps assess how a software-assurance tool performs on larger programs. Because of the number of files and size of code, some tools might not be able to analyze all these test cases as a single program. Another use is to analyze separate test cases individually or in groups.
Because the manifest indicates where flaws occur, users can evaluate tool reports semiautomatically. When users run a source code analysis tool on a test case, the desired result is for the tool to report one or more flaws of the target type. A report of this type might be considered a true positive. If the tool doesn’t report a flaw of the target type in a bad method, it might be considered a false negative. Ideally, the tool won’t report flaws of the target type in a “good” test case or function; a report of this type might be considered a false positive. Because flawed and similar unflawed code might be in infeasible or “dead” code, users’ policies on warnings about infeasible code must be taken into account.
As an illustration of using SARD, we offer our development of a small number of cases to show that a tool is effective at finding stack-based buffer overflows. First, we downloaded all buffer overflow test cases from SARD. To expedite analysis, we split every Juliet test case into two cases: one with only the bad code and one with only the good code. We also removed some unreachable code and conditional compilation commands. This resulted in 7338 test cases. We ran five tools on those cases. We compared, discussed, and grouped results until we came up with seven principles for selecting test cases . This would have been far harder without the resources of SARD.
We know of several other fixed collections of software assurance test cases. Some include tools to run experiments and compute results. After we itemize those collections, we list work to generate sets as needed.
The Software-artifact Infrastructure Repository (SIR) is “meant to support rigorous controlled experimentation with program analysis and software testing techniques, and education in controlled experimentation.”  It provides Java, C, C++, and C# programs in multiple versions, along with testing tools, documentation, and other material. We found 85 objects, the most recent updated in 2015.
FaultBench “is a set of real, subject Java programs for comparison and evaluation of actionable alert identification techniques (AAITs) that supplement automated static analysis.”  FaultBench has 780 bugs across six programs.
The OWASP Benchmark for Security Automation “is a free and open test suite designed to evaluate the speed, coverage, and accuracy of automated software vulnerability detection tools and services” . It has 2740 small test cases, both with weaknesses and without weaknesses, in Java. It includes programs and scripts to run a tool and compute some results.
The Software Assurance Marketplace (SWAMP) provides more than 270 packages, in addition to the Juliet test suite, which is described above. “Packages are collections of files containing code to be assessed along with information about how to build the software package, if necessary.”  There are packages in Java, Python, Ruby, C, C++, and web scripting languages. Each package may have multiple versions.
SARD approaches the problem of test cases by collecting a static set. An alternative is to generate sets of cases on demand. In theory, one could specify the language, weaknesses, code complexities, and other facets, and get a—potentially unique—set as needed. Generating sets on demand would be one way to address the concern that tool makers might add bits of code just to get a high score on a static benchmark. Generated cases could be automatically obfuscated, too. The disadvantage is that each generated set would have to be examined to be sure that the cases serve their purpose (or else the generator itself would have to be qualified, which is much harder). In practice, code generators or bug injectors are enormously difficult. Nevertheless, there is some work.
Large-Scale Automated Vulnerability Addition (LAVA) creates corpora by injecting large numbers of bugs into existing source code . EvilCoder also injects bugs into existing code, although it injects bugs by locating guard or checking code and selectively disabling it .
The test generators implemented by TELECOM Nancy students are the source of large SARD test suites in PHP and C# . The Department of Homeland Security’s Static Tool Analysis Modernization Project (STAMP) recently awarded a contract to GrammaTech that includes development of a test case generator, Bug Injector . KDM Analytics Inc. is enhancing their test case generator, TCG, for CAS. The latest version, TCG 3.2, produces both “bad” (flawed) and “good” (false positive) cases in C, C++, Java, and C# with control, data, and scope complexities . TCG can generate millions of cases covering some three dozen Software Fault Pattern (SFP) clusters  and many CWEs for either or both Linux and Microsoft Windows platforms. Generated cases don’t have a main() function, which allows cases to be compiled individually or as one large program.
Future of SARD
SARD began in 2006 in order to collect test cases for the NIST SAMATE. We had planned to collect artifacts from all phases of the software development life cycle, including designs, source code, and binaries, in order to evaluate assurance tools for all of those. We still leave that option open, but have not yet found many tools and pressing needs for other phases.
We invite developers and researchers to donate their collections to SARD. It is a loss to the community when someone puts a lot of effort into developing a collection, then, after several years and project changes, the collection is lost.
Currently weaknesses are labeled by CWE. We will add labels of Bugs Framework (BF) classes  when it is more complete.
Analysts, users, and developers have cut months off the time needed to evaluate a tool or technique using test cases from SARD. SARD has been used by tool implementers, software testers, security analysts, and four SATEs to expand awareness of static analysis tools. Educators can refer students to SARD to find examples of weaknesses. Having a reliable and growing body of code with known weaknesses helps the community improve software assurance tools themselves and encourage their appropriate use.
We thank David Flater for making the chart in Fig. 3 and Charles de Oliveira for elucidating much of the SARD content.