Awesome Empirical Software Engineering
      
    
    
      A curated repository of data sets and tools that can be used for
      conducting evidence-based, data-driven research on software systems. This
      research approach is often termed
      experimental, or empirical software engineering. Many of the data sets can also be useful in research using
      search-based software engineering
      methods. The repository is named after the
      Mining Software Repositories (MSR)
      conference series. For examples of such work see the MSR conference’s
      Hall of Fame.
    
    
      - 
        This list requires your input for its continuous improvement. Read the
        contribution guide for instructions on how
        you can contribute. Alternatively, you can send me an
        email if you find the process too
        cumbersome or confusing.
      
 
      - 
        For more awesome lists, see
        awesome.
      
 
    
    Contents
    
    Repositories
    
      - 
        SIR -
        Software-artifact infrastructure repository; Java, C, C++, and C#
        software together with test suites and fault data.
      
 
      - 
        PROMISE
        - About 20 datasets related to software engineering research.
      
 
      - 
        FLOSSmole -
        Collaborative collection and analysis of free/libre/open source project
        data.
      
 
      - 
        Zenodo - Software data collections in
        CERN’s open-access repository.
        
      
 
    
    Data Sets
    
      - 
        AndroidTimeMachine -
        Graph-based dataset of commit history of 8,431 real-world Android apps.
      
 
      - 
        AndroZoo - Collection of Android
        Applications.
      
 
      - 
        Bug Prediction Dataset -
        Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox
        Framework, Lucene, Mylyn, and their histories.
      
 
      - 
        Code Reviews -
        Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse.
      
 
      - 
        CoREBench
        - Collection of 70 realistically Complex Regression Errors that were
        systematically extracted from the repositories and bug reports of four
        open-source software projects: Make, Grep, Findutils, and Coreutils.
      
 
      - 
        Cryptocurrency GitHub Activity and Market Cap Dataset
        - Activity such as commits, stars, prices, and market cap of over 200
        cryptocurrency projects on GitHub over time. Raw, historic data is also
        available.
      
 
      - 
        Defects4J - Collection
        of 395 reproducible bugs collected with the goal of advancing software
        testing research.
      
 
      - 
        Eclipse AERI stacktraces
        - Collection of stacktraces of Exceptions encountered by users of the
        Eclipse IDE, as retrieved by the AERI reporting system.
      
 
      - 
        Enron Spreadsheets and Emails
        - All the spreadsheets and emails used in the paper ‘Enron’s
        Spreadsheets and Related Emails: A Dataset and Analysis’.
      
 
      - 
        Findbugs-maven
        - Set of FindBugs reports for the Java projects of the
        Maven repository.
      
 
      - 
        GHTorrent - Scalable, queriable,
        offline mirror of data offered through the GitHub REST API.
      
 
      - 
        GitHub Bug Dataset
        - Bug Dataset of 15 Java open-source projects characterized by static
        source code metrics.
      
 
      - 
        GitHub on Google BigQuery
        - GitHub data accessible through Google’s BigQuery platform.
      
 
      - 
        Grammar Zoo - Collection of
        grammars of DSLs and GPLs, some extracted from metamodels and document
        schemata.
      
 
      - 
        KaVE - Developer tool
        interaction data.
      
 
      - 
        Linux Kernel 4.21 Call Graphs
        - The Linux Kernel 4.21 Call Graphs produced using
        CScout.
      
 
      - 
        Maven metrics -
        Collection of software complexity & sizing metrics for the
        Maven Repository.
      
 
      - 
        Maven Dependency Graph -
        Snapshot of the whole Maven Central taken on September 6, 2018, stored
        in a graph database.
      
 
      - 
        mzdata - Multi-extract
        and multi-level dataset of Mozilla issue tracking history.
      
 
      - 
        npm-miner
        - The dataset contains the analysis results of 5 open source software
        quality tools eslint, escomplex, nsp, jsinspect and sonarjs for 2000
        popular (in terms of stars and downloads) npm packages.
      
 
      - 
        OCL Expressions on GitHub
        - Data set of 9188 OCL expressions originating from 504 EMF meta-models
        in 245 systematically selected GitHub repositories.
      
 
      - 
        RepoReapers Data Set - Data
        set containing a collection of
        engineered software projects from GHTorrent.
      
 
      - 
        Software Heritage Graph Dataset
        - Graph of the development history and file metadata of >80 million
        software projects from various forges (GitHub, Gitlab, Debian, PyPI,
        Google Code, etc) in a deduplicated and unified representation (paper here).
      
 
      - 
        STAMINA - (STAte
        Machine INference Approaches) data are used to benchmark techniques for
        learning deterministic finite state machines (FSMs).
      
 
      - 
        Stack Exchange -
        Anonymized dump of all user-contributed content on the Stack Exchange
        network.
      
 
      - 
        TravisTorrent -
        Provides free and easy-to-use Traivs CI build analyses.
      
 
      - 
        Ultimate Debian Database (UDD)
        - Data about various aspects of Debian (e.g. packages, bugs, mainteners)
        in the same SQL database.
      
 
      - 
        Unified Bug Dataset
        - Static source code based datasets which includes the Bugcatchers Bug
        Dataset, the
        Bug Prediction Dataset,
        the
        Eclipse Bug Dataset, the
        GitHub Bug Dataset, some datasets from the
        PROMISE
        repository.
      
 
      - 
        Unix history
        - Git repository with 46 years of Unix history evolution.
      
 
    
    
    
      - 
        astminer -
        Library and tool for mining of path-based representations of code and
        other data derived from ASTs.
      
 
      - 
        Boa - Domain-specific language
        and infrastructure that eases mining software repositories.
      
 
      - 
        buckwheat
        - Multi-language tokenizer for extracting identifiers from source code.
      
 
      - 
        ckjm - Chidamber and
        Kemerer Java Metrics.
      
 
      - 
        Coming - A Java
        framework for analyzing code changes and mining instances of change
        patterns from Git repositories.
      
 
      - 
        CryptOSS - Mine
        GitHub activity and market cap data for cryptocurrency projects.
      
 
      - 
        DbDeo - Extract
        embedded SQL statements and detect database schema smells.
      
 
      - 
        Designite - Compute source
        code metrics and detect a variety of implementation, design, and
        architecture smells for C#.
      
 
      - 
        DesigniteJava
        - Compute source code metrics and detect a variety of implementation and
        design smells for Java.
      
 
      - 
        Diggit - Agile Ruby
        Tool to analyze Git repositories.
      
 
      - 
        GrimoireLab -
        Free/Libre/Open Source tools for Software Development Analytics.
      
 
      - 
        MetricMiner
        - Lean Java DSL to mine and extract data (e.g. commits, developers,
        modifications, diffs) from Git and SVN repositories.
      
 
      - 
        Maven-miner
        - Java tools and infrastructure to resolve the whole Maven dependency
        graph, hosted in Maven Central, in the form of a
        Neo4j Graph.
      
 
      - 
        Perceval -
        Fetch repository data from tens of back-ends.
      
 
      - 
        Puppeteer -
        Detect configuration smells in Puppet code.
      
 
      - 
        PyDriller - Python
        Framework to analyse Git repositories.
      
 
      - 
        qmcalc - Calculate
        quality metrics from C source code.
      
 
      - 
        reaper - Python tool
        to compute a score for a repository from GHTorrent. The score quantifies
        the extent to which the project contained within the repository is
        engineered.
      
 
      - 
        RefactoringMiner
        - Library/API for detection of refactorings in changes of Java code.
      
 
      - 
        VulData7 - Java
        framework enabling the automated collection of commits fixing
        vulnerabilities that are reported in NVD (links NVD with Git).
      
 
    
    Research Outlets
    
      - 
        Outlets exclusively devoted to empirical software engineering research
        
      
 
      - 
        Outlets that publish empirical software engineering research
        
      
 
    
    License
    
      
    
    
      To the extent possible under law,
      Diomidis Spinellis has waived all
      copyright and related or neighboring rights to this work.