Awesome Data Engineering
    
      A curated list of data engineering tools for software developers
      
    
    List of content
    
      - Databases
 
      - Ingestion
 
      - File System
 
      - Serialization format
 
      - Stream Processing
 
      - Batch Processing
 
      - Charts and Dashboards
 
      - Workflow
 
      - Data Lake Management
 
      - 
        ELK Elastic Logstash Kibana
      
 
      - Docker
 
      - Datasets
 
      - Monitoring
 
      - Community
 
    
    Databases
    
      - 
        Relational
        
          - 
            RQLite Replicated
            SQLite using the Raft consensus protocol
          
 
          - 
            MySQL The world’s most popular
            open source database.
            
              - 
                TiDB TiDB is a
                distributed NewSQL database compatible with MySQL protocol
              
 
              - 
                Percona XtraBackup
                Percona XtraBackup is a free, open source, complete online
                backup solution for all versions of Percona Server, MySQL® and
                MariaDB®
              
 
              - 
                mysql_utils
                Pinterest MySQL Management Tools
              
 
            
           
          - 
            MariaDB An enhanced, drop-in
            replacement for MySQL.
          
 
          - 
            PostgreSQL The world’s
            most advanced open source database.
          
 
          - 
            Amazon RDS Amazon RDS
            makes it easy to set up, operate, and scale a relational database in
            the cloud.
          
 
          - 
            Crate.IO Scalable SQL database with
            the NOSQL goodies.
          
 
        
       
      - 
        Key-Value
        
          - 
            Redis An open source, BSD licensed,
            advanced key-value cache and store.
          
 
          - 
            Riak A distributed
            database designed to deliver maximum data availability by
            distributing data across multiple servers.
          
 
          - 
            AWS DynamoDB A fast
            and flexible NoSQL database service for all applications that need
            consistent, single-digit millisecond latency at any scale.
          
 
          - 
            HyperDex HyperDex
            is a scalable, searchable key-value store. Deprecated.
          
 
          - 
            SSDB A high performance NoSQL database
            supporting many data structures, an alternative to Redis
          
 
          - 
            Kyoto Tycoon Kyoto
            Tycoon is a lightweight network server on top of the Kyoto Cabinet
            key-value database, built for high-performance and concurrency
          
 
          - 
            IonDB A
            key-value store for microcontroller and IoT applications
          
 
        
       
      - 
        Column
        
          - 
            Cassandra The right
            choice when you need scalability and high availability without
            compromising performance.
            
              - 
                Cassandra Calculator
                This simple form allows you to try out different values for your
                Apache Cassandra cluster and see what the impact is for your
                application.
              
 
              - 
                CCM A script to
                easily create and destroy an Apache Cassandra cluster on
                localhost
              
 
              - 
                ScyllaDB NoSQL
                data store using the seastar framework, compatible with Apache
                Cassandra https://www.scylladb.com/
              
 
            
           
          - 
            HBase The Hadoop database, a
            distributed, scalable, big data store.
          
 
          - 
            AWS Redshift A fast,
            fully managed, petabyte-scale data warehouse that makes it simple
            and cost-effective to analyze all your data using your existing
            business intelligence tools.
          
 
          - 
            FiloDB Distributed.
            Columnar. Versioned. Streaming. SQL.
          
 
          - 
            Vertica Distributed, MPP
            columnar database with extensive analytics SQL.
          
 
          - 
            ClickHouse Distributed
            columnar DBMS for OLAP. SQL.
          
 
        
       
      - 
        Document
        
          - 
            MongoDB An open-source,
            document database designed for ease of development and scaling.
            
              - 
                Percona Server for MongoDB
                Percona Server for MongoDB® is a free, enhanced, fully
                compatible, open source, drop-in replacement for the MongoDB®
                Community Edition that includes enterprise-grade features and
                functionality.
              
 
              - 
                MemDB
                Distributed Transactional In-Memory Database (based on MongoDB)
              
 
            
           
          - 
            Elasticsearch Search &
            Analyze Data in Real Time.
          
 
          - 
            Couchbase The highest
            performing NoSQL distributed database.
          
 
          - 
            RethinkDB The open-source
            database for the realtime web.
          
 
          - 
            RavenDB Fully Transactional NoSQL
            Document Database.
          
 
        
       
      - 
        Graph
        
          - 
            Neo4j The world’s leading graph
            database.
          
 
          - 
            OrientDB 2nd Generation
            Distributed Graph Database with the flexibility of Documents in one
            product with an Open Source commercial friendly license.
          
 
          - 
            ArangoDB A distributed free
            and open-source database with a flexible data model for documents,
            graphs, and key-values.
          
 
          - 
            Titan A scalable graph
            database optimized for storing and querying graphs containing
            hundreds of billions of vertices and edges distributed across a
            multi-machine cluster.
          
 
          - 
            FlockDB A
            distributed, fault-tolerant graph database by Twitter. Deprecated.
          
 
        
       
      - 
        Distributed
        
          - 
            DAtomic The fully
            transactional, cloud-ready, distributed database.
          
 
          - 
            Apache Geode An open source,
            distributed, in-memory database for scale-out applications.
          
 
          - 
            Gaffer A large-scale
            graph database
          
 
        
       
      - 
        Timeseries
        
          - 
            InfluxDB
            Scalable datastore for metrics, events, and real-time analytics.
          
 
          - 
            OpenTSDB A
            scalable, distributed Time Series Database.
          
 
          - 
            QuestDB A relational
            column-oriented database designed for real-time analytics on time
            series and event data.
          
 
          - 
            kairosdb Fast
            scalable time series database.
          
 
          - 
            Heroic A scalable
            time series database based on Cassandra and Elasticsearch, by
            Spotify
          
 
          - 
            Druid Column
            oriented distributed data store ideal for powering interactive
            applications
          
 
          - 
            Riak-TS Riak TS is
            the only enterprise-grade NoSQL time series database optimized
            specifically for IoT and Time Series data
          
 
          - 
            Akumuli Akumuli is
            a numeric time-series database. It can be used to capture, store and
            process time-series data in real-time. The word “akumuli” can be
            translated from esperanto as “accumulate”.
          
 
          - 
            Rhombus A
            time-series object store for Cassandra that handles all the
            complexity of building wide row indexes.
          
 
          - 
            Dalmatiner DB
            Fast distributed metrics database
          
 
          - 
            Blueflood A
            distributed system designed to ingest and process time series data
          
 
          - 
            Timely
            Timely is a time series database application that provides secure
            access to time series data based on Accumulo and Grafana.
          
 
        
       
      - 
        Other
        
          - 
            Tarantool
            Tarantool is an in-memory database and application server.
          
 
          - 
            GreenPlum The
            Greenplum Database (GPDB) is an advanced, fully featured, open
            source data warehouse. It provides powerful and rapid analytics on
            petabyte scale data volumes.
          
 
          - 
            cayley An
            open-source graph database. Google.
          
 
          - 
            Snappydata
            SnappyData: OLTP + OLAP Database built on Apache Spark
          
 
          - 
            TimescaleDB: Built as an
            extension on top of PostgreSQL, TimescaleDB is a time-series SQL
            database providing fast analytics, scalability, with automated data
            management on a proven storage engine.
          
 
        
       
    
    Data Ingestion
    
      - 
        Kafka Publish-subscribe
        messaging rethought as a distributed commit log.
        
          - 
            BottledWater
            Change data capture from PostgreSQL into Kafka. Deprecated.
          
 
          - 
            kafkat Simplified
            command-line administration for Kafka brokers
          
 
          - 
            kafkacat Generic
            command line non-JVM Apache Kafka producer and consumer
          
 
          - 
            pg-kafka A
            PostgreSQL extension to produce messages to Apache Kafka
          
 
          - 
            librdkafka The
            Apache Kafka C/C++ library
          
 
          - 
            kafka-docker
            Kafka in Docker
          
 
          - 
            kafka-manager A
            tool for managing Apache Kafka
          
 
          - 
            kafka-node
            Node.js client for Apache Kafka 0.8
          
 
          - 
            Secor Pinterest’s
            Kafka to S3 distributed consumer
          
 
          - 
            Kafka-logger
            Kafka-winston logger for nodejs from uber
          
 
        
       
      - 
        AWS Kinesis A fully
        managed, cloud-based service for real-time data processing over large,
        distributed data streams.
      
 
      - 
        RabbitMQ Robust messaging for
        applications.
      
 
      - 
        FluentD An open source data
        collector for unified logging layer.
      
 
      - 
        Embulk An open source bulk data
        loader that helps data transfer between various databases, storages,
        file formats, and cloud services.
      
 
      - 
        Apache Sqoop A tool designed for
        efficiently transferring bulk data between Apache Hadoop and structured
        datastores such as relational databases.
      
 
      - 
        Heka Data
        Acquisition and Processing Made Easy. Deprecated.
      
 
      - 
        Gobblin
        Universal data ingestion framework for Hadoop from Linkedin
      
 
      - 
        Nakadi Nakadi is an open source event
        messaging platform that provides a REST API on top of Kafka-like queues.
      
 
      - 
        Pravega Pravega provides a new
        storage abstraction - a stream - for continuous and unbounded data.
      
 
      - 
        Apache Pulsar Apache Pulsar is
        an open-source distributed pub-sub messaging system.
      
 
      - 
        AWS Data Wranlger
        Utility belt to handle data on AWS.
      
 
    
    File System
    
      - 
        HDFS
        
      
 
      - 
        AWS S3
        
          - 
            smart_open
            Utils for streaming large files (S3, HDFS, gzip, bz2)
          
 
        
       
      - 
        Alluxio Alluxio is a
        memory-centric distributed storage system enabling reliable data sharing
        at memory-speed across cluster frameworks, such as Spark and MapReduce
      
 
      - 
        CEPH Ceph is a unified, distributed
        storage system designed for excellent performance, reliability and
        scalability
      
 
      - 
        OrangeFS Orange File System is a
        branch of the Parallel Virtual File System
      
 
      - 
        SnackFS
        SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built
        over Cassandra
      
 
      - 
        GlusterFS Gluster Filesystem
      
 
      - 
        XtreemFS fault-tolerant
        distributed file system for all storage needs
      
 
      - 
        SeaweedFS
        Seaweed-FS is a simple and highly scalable distributed file system.
        There are two objectives: to store billions of files! to serve the files
        fast! Instead of supporting full POSIX file system semantics, Seaweed-FS
        choose to implement only a key~file mapping. Similar to the word
        “NoSQL”, you can call it as “NoFS”.
      
 
      - 
        S3QL S3QL is a file system
        that stores all its data online using storage services like Google
        Storage, Amazon S3, or OpenStack.
      
 
      - 
        LizardFS LizardFS Software Defined
        Storage is a distributed, parallel, scalable, fault-tolerant,
        Geo-Redundant and highly available file system.
      
 
    
    
    
      - 
        Apache Avro Apache Avro™ is a data
        serialization system
      
 
      - 
        Apache Parquet Apache Parquet
        is a columnar storage format available to any project in the Hadoop
        ecosystem, regardless of the choice of data processing framework, data
        model or programming language.
        
          - 
            Snappy A fast
            compressor/decompressor. Used with Parquet
          
 
          - 
            PigZ A parallel implementation
            of gzip for modern multi-processor, multi-core machines
          
 
        
       
      - 
        Apache ORC The smallest, fastest
        columnar storage for Hadoop workloads
      
 
      - 
        Apache Thrift The Apache Thrift
        software framework, for scalable cross-language services development
      
 
      - 
        ProtoBuf
        Protocol Buffers - Google’s data interchange format
      
 
      - 
        SequenceFile
        SequenceFile is a flat file consisting of binary key/value pairs. It is
        extensively used in MapReduce as input/output formats
      
 
      - 
        Kryo Kryo is a
        fast and efficient object graph serialization framework for Java
      
 
    
    Stream Processing
    
      - 
        Apache Beam Apache Beam is a
        unified programming model that implements both batch and streaming data
        processing jobs that run on many execution engines.
      
 
      - 
        Spark Streaming Spark
        Streaming makes it easy to build scalable fault-tolerant streaming
        applications.
      
 
      - 
        Apache Flink Apache Flink is a
        streaming dataflow engine that provides data distribution,
        communication, and fault tolerance for distributed computations over
        data streams.
      
 
      - 
        Apache Storm Apache Storm is a
        free and open source distributed realtime computation system
      
 
      - 
        Apache Samza Apache Samza is a
        distributed stream processing framework
      
 
      - 
        Apache NiFi is an easy to use,
        powerful, and reliable system to process and distribute data
      
 
      - 
        Apache Hudi Apache Hudi is an
        open source framework for managing storage for real time processing, one
        of the most interesting feature is the Upsert
      
 
      - 
        VoltDB VoltDb is an ACID-compliant
        RDBMS which uses a
        shared nothing architecture.
      
 
      - 
        PipelineDB The
        Streaming SQL Database
      
 
      - 
        Spring Cloud Dataflow
        Streaming and tasks execution between Spring Boot apps
      
 
      - 
        Bonobo Bonobo is a
        data-processing toolkit for python 3.5+
      
 
      - 
        Robinhood’s Faust
        Forever scalable event processing & in-memory durable K/V store as a
        library with asyncio & static typing.
      
 
      - 
        HStreamDB The
        streaming database built for IoT data storage and real-time processing.
      
 
      - 
        Kuiper An edge lightweight
        IoT data analytics/streaming software implemented by Golang, and it can
        be run at all kinds of resource-constrained edge devices.
      
 
    
    Batch Processing
    
      - 
        Hadoop MapReduce
        Hadoop MapReduce is a software framework for easily writing applications
        which process vast amounts of data (multi-terabyte data-sets)
        in-parallel on large clusters (thousands of nodes) of commodity hardware
        in a reliable, fault-tolerant manner
      
 
      - 
        Spark
        
      
 
      - 
        AWS EMR A web service that
        makes it easy to quickly and cost-effectively process vast amounts of
        data.
      
 
      - 
        Tez An application framework which
        allows for a complex directed-acyclic-graph of tasks for processing
        data.
      
 
      - 
        Bistro is a
        light-weight engine for general-purpose data processing including both
        batch and stream analytics. It is based on a novel unique data model,
        which represents data via functions and processes data via
        columns operations as opposed to having only set operations in
        conventional approaches like MapReduce or SQL.
      
 
      - 
        Batch ML
        
          - 
            H2O Fast scalable machine learning
            API for smarter applications.
          
 
          - 
            Mahout An environment for
            quickly creating scalable performant machine learning applications.
          
 
          - 
            Spark MLlib
            Spark’s scalable machine learning library consisting of common
            learning algorithms and utilities, including classification,
            regression, clustering, collaborative filtering, dimensionality
            reduction, as well as underlying optimization primitives.
          
 
        
       
      - 
        Batch Graph
        
          - 
            GraphLab Create
            A machine learning platform that enables data scientists and app
            developers to easily create intelligent apps at scale.
          
 
          - 
            Giraph An iterative graph
            processing system built for high scalability.
          
 
          - 
            Spark GraphX Apache
            Spark’s API for graphs and graph-parallel computation.
          
 
        
       
      - 
        Batch SQL
        
          - 
            Presto
            A distributed SQL query engine designed to query large data sets
            distributed over one or more heterogeneous data sources.
          
 
          - 
            Hive Data warehouse software
            facilitates querying and managing large datasets residing in
            distributed storage.
            
              - 
                Hivemall
                Scalable machine learning library for Hive/Hadoop.
              
 
              - 
                PyHive Python
                interface to Hive and Presto.
              
 
            
           
          - 
            Drill Schema-free SQL Query
            Engine for Hadoop, NoSQL and Cloud Storage.
          
 
        
       
    
    Charts and Dashboards
    
      - 
        Highcharts A charting library
        written in pure JavaScript, offering an easy way of adding interactive
        charts to your web site or web application.
      
 
      - 
        ZingChart Fast JavaScript
        charts for any data set.
      
 
      - 
        C3.js D3-based reusable chart library.
      
 
      - 
        D3.js A JavaScript library for
        manipulating documents based on data.
        
          - 
            D3Plus D3’s simplier, easier to use
            cousin. Mostly predefined templates that you can just plug data in.
          
 
        
       
      - 
        SmoothieCharts A JavaScript
        Charting Library for Streaming Data.
      
 
      - 
        PyXley Python helpers
        for building dashboards using Flask and React
      
 
      - 
        Plotly Flask, JS, and CSS
        boilerplate for interactive, web-based visualization apps in Python
      
 
      - 
        Apache Superset
        Apache Superset (incubating) is a modern, enterprise-ready business
        intelligence web application
      
 
      - 
        Redash Make Your Company Data Driven.
        Connect to any data source, easily visualize and share your data.
      
 
      - 
        Metabase Metabase is
        the easy, open source way for everyone in your company to ask questions
        and learn from data.
      
 
      - 
        PyQtGraph PyQtGraph is a
        pure-python graphics and GUI library built on PyQt4 / PySide and numpy.
        It is intended for use in mathematics / scientific / engineering
        applications.
      
 
    
    Workflow
    
      - 
        Luigi Luigi is a Python
        module that helps you build complex pipelines of batch jobs.
        
          - 
            CronQ An application
            cron-like system.
            Used
            w/Luige. Deprecated.
          
 
        
       
      - 
        Cascading Java based
        application development platform.
      
 
      - 
        Airflow Airflow is a
        system to programmaticaly author, schedule and monitor data pipelines.
      
 
      - 
        Azkaban Azkaban is a batch
        workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban
        resolves the ordering through job dependencies and provides an easy to
        use web user interface to maintain and track your workflows.
      
 
      - 
        Oozie Oozie is a workflow
        scheduler system to manage Apache Hadoop jobs
      
 
      - 
        Pinball DAG based
        workflow manager. Job flows are defined programmaticaly in Python.
        Support output passing between jobs.
      
 
      - 
        Dagster Dagster is
        an open-source Python library for building data applications.
      
 
      - 
        Dataform is an open-source framework
        and web based IDE to manage datasets and their dependencies. SQLX
        extends your existing SQL warehouse dialect to add features that support
        dependency management, testing, documentation and more.
      
 
      - 
        Census is a reverse-ETL tool that
        let you sync data from your cloud data warehouse to SaaS applications
        like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors
        required—just SQL.
      
 
      - 
        dbt is a command line tool that
        enables data analysts and engineers to transform data in their
        warehouses more effectively.
      
 
    
    Data Lake Management
    
      - 
        lakeFS lakeFS is an
        open source platform that delivers resilience and manageability to
        object-storage based data lakes.
      
 
    
    ELK Elastic Logstash Kibana
    
      - 
        docker-logstash
        A highly configurable logstash (1.4.4) docker image running
        Elasticsearch (1.7.0) and Kibana (3.1.2).
      
 
      - 
        elasticsearch-jdbc
        JDBC importer for Elasticsearch
      
 
      - 
        ZomboDB Postgres
        Extension that allows creating an index backed by Elasticsearch
      
 
    
    Docker
    
      - 
        Gockerize Package
        golang service into minimal docker containers
      
 
      - 
        Flocker Easily manage
        Docker containers & their data
      
 
      - 
        Rancher RancherOS is a
        20mb Linux distro that runs the entire OS as Docker containers
      
 
      - 
        Kontena Application Containers for
        Masses
      
 
      - 
        Weave Weaving Docker
        containers into applications
      
 
      - 
        Zodiac A
        lightweight tool for easy deployment and rollback of dockerized
        applications
      
 
      - 
        cAdvisor Analyzes
        resource usage and performance characteristics of running containers
      
 
      - 
        Micro S3 persistence
        Docker microservice for saving/restoring volume data to S3
      
 
      - 
        Rocker-compose
        Docker composition tool with idempotency features for deploying apps
        composed of multiple containers. Deprecated.
      
 
      - 
        Nomad Nomad is a
        cluster manager, designed for both long lived services and short lived
        batch processing workloads
      
 
      - 
        ImageLayers Vizualize docker
        images and the layers that compose them
      
 
    
    Datasets
    Realtime
    
      - 
        Twitter Realtime
        The Streaming APIs give developers low latency access to Twitter’s
        global stream of Tweet data.
      
 
      - 
        Eventsim Event data
        simulator. Generates a stream of pseudo-random events from a set of
        users, designed to simulate web traffic.
      
 
      - 
        Reddit
        Real-time data is available including comments, submissions and links
        posted to reddit
      
 
    
    Data Dumps
    
      - 
        GitHub Archive GitHub’s public
        timeline since 2011, updated every hour
      
 
      - 
        Common Crawl Open source
        repository of web crawl data
      
 
      - 
        Wikipedia
        Wikipedia’s complete copy of all wikis, in the form of wikitext source
        and metadata embedded in XML. A number of raw database tables in SQL
        form are also available.
      
 
    
    Monitoring
    Prometheus
    
      - 
        Prometheus.io An
        open-source service monitoring system and time series database
      
 
      - 
        HAProxy Exporter
        Simple server that scrapes HAProxy stats and exports them via HTTP for
        Prometheus consumption
      
 
    
    
    Forums
    
    Conferences
    
      - 
        Data Council Data Council
        is the first technical conference that bridges the gap between data
        scientists, data engineers and data analysts.
      
 
    
    Podcasts
    
    
      Cheers to
      The Data Engineering Ecosystem: An Interactive Map
    
    
      Inspired by the
      awesome list.
      Created by
      Insight Data Engineering
      fellows.
    
    License
    
      
    
    
      To the extent possible under law,
      Igor Barinov has waived all
      copyright and related or neighboring rights to this work.