BinPool: Unlock Deeper Vulnerability Discovery in Binaries with This New Dataset

BinPool is a dataset consisting of vulnerable and patched binaries derived from historical Debian packages, compiled using four different optimization levels. It can be used for vulnerability discovery tasks through various methods, including machine learning and static analysis.

Features

BinPool provides the following features:

  • Provides 603 unique CVEs and more than 80 CWEs.
  • Includes the fix version of the corresponding Debian package for each CVE.
  • Covers various programming languages (C, C++, Java, Python, PHP).
  • Provides function and module names present in both patch and binary versions.
Measurement Value
Number of Unique CVEs 603
Number of CWEs 89
Number of Debian Files 824
Total Number of Binaries 6144
Number of Debian Packages 162
Number of Source Modules 768
Number of Source Functions 910
Number of Binary Functions 7280

Below is a list of the most frequent CWEs in BinPool:

CWE CWE Name Count
CWE-787 Out-of-bounds Write 71
CWE-476 NULL Pointer Dereference 61
CWE-125 Out-of-bounds Read 54
CWE-190 Integer Overflow or Wraparound 34
CWE-20 Improper Input Validation 28
CWE-416 Use After Free 27
CWE-400 Uncontrolled Resource Consumption 20

You can download the dataset from Zenodo.

After downloading the data, the structure will be as follows:

CVE-ID/

├── vulnerable/ # Directory containing vulnerable versions
│ ├── opt0/ # Optimization level 0 for vulnerable version
│ ├── opt1/ # Optimization level 1 for vulnerable version
│ ├── opt2/ # Optimization level 2 for vulnerable version
│ └── opt3/ # Optimization level 3 for vulnerable version

└── patch/ # Directory containing patched versions
├── opt0/ # Optimization level 0 for patched version
├── opt1/ # Optimization level 1 for patched version
├── opt2/ # Optimization level 2 for patched version
└── opt3/ # Optimization level 3 for patched version

Install & Use