Awesome Hadoop
      
    
    
      A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources.
      Inspired by
      Awesome PHP,
      Awesome Python and
      Awesome Sysadmin
    
    
    Hadoop
    
      - 
        Apache Hadoop - Apache Hadoop
      
 
      - 
        Apache Hadoop Ozone - An
        Object Store for Apache Hadoop
      
 
      - 
        Apache Tez - A Framework for
        YARN-based, Data Processing Applications In Hadoop
      
 
      - 
        SpatialHadoop -
        SpatialHadoop is a MapReduce extension to Apache Hadoop designed
        specially to work with spatial data.
      
 
      - 
        GIS Tools for Hadoop
        - Big Data Spatial Analytics for the Hadoop Framework
      
 
      - 
        Elasticsearch Hadoop
        - Elasticsearch real-time search and analytics natively integrated with
        Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
      
 
      - 
        hadoopy - Python
        MapReduce library written in Cython.
      
 
      - 
        mrjob - mrjob is a Python
        2.5+ package that helps you write and run Hadoop Streaming jobs.
      
 
      - 
        pydoop - Pydoop is a
        package that provides a Python API for Hadoop.
      
 
      - 
        hdfs-du - HDFS-DU is an
        interactive visualization of the Hadoop distributed file system.
      
 
      - 
        White Elephant
        - Hadoop log aggregator and dashboard
      
 
      - 
        Genie - Genie provides
        REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple
        Hadoop resources and perform job submissions across them.
      
 
      - 
        Apache Kylin - Apache
        Kylin is an open source Distributed Analytics Engine from eBay Inc. that
        provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop
        supporting extremely large datasets
      
 
      - 
        Crunch - Go-based toolkit
        for ETL and feature extraction on Hadoop
      
 
      - 
        Apache Ignite - Distributed
        in-memory platform
      
 
    
    YARN
    
      - 
        Apache Slider - Apache
        Slider is a project in incubation at the Apache Software Foundation with
        the goal of making it possible and easy to deploy existing applications
        onto a YARN cluster.
      
 
      - 
        Apache Twill - Apache
        Twill is an abstraction over Apache Hadoop® YARN that reduces the
        complexity of developing distributed applications, allowing developers
        to focus more on their application logic.
      
 
      - 
        mpich2-yarn -
        Running MPICH2 on Yarn
      
 
    
    NoSQL
    
      Next Generation Databases mostly addressing some of the points: being
        non-relational, distributed, open-source and horizontally scalable.
    
    
      - Apache HBase - Apache HBase
 
      - 
        Apache Phoenix - A SQL skin
        over HBase supporting secondary indices
      
 
      - 
        happybase - A
        developer-friendly Python library to interact with Apache HBase.
      
 
      - 
        Hannibal - Hannibal is
        tool to help monitor and maintain HBase-Clusters that are configured for
        manual splitting.
      
 
      - 
        Haeinsa - Haeinsa is
        linearly scalable multi-row, multi-table transaction library for HBase
      
 
      - 
        hindex - Secondary
        Index for HBase
      
 
      - 
        Apache Accumulo - The Apache
        Accumulo™ sorted, distributed key/value store is a robust, scalable,
        high performance data storage and retrieval system.
      
 
      - 
        OpenTSDB - The Scalable Time Series
        Database
      
 
      - Apache Cassandra
 
    
    SQL on Hadoop
    SQL on Hadoop
    
      - 
        Apache Hive - The Apache Hive data
        warehouse software facilitates reading, writing, and managing large
        datasets residing in distributed storage using SQL
      
 
      - 
        Apache Phoenix A SQL skin over
        HBase supporting secondary indices
      
 
      - 
        Apache HAWQ (incubating)
        - Apache HAWQ is a Hadoop native SQL query engine that combines the key
        technological advantages of MPP database with the scalability and
        convenience of Hadoop
      
 
      - 
        Lingual - SQL
        interface for Cascading (MR/Tez job generator)
      
 
      - 
        Apache Impala - Apache Impala
        is an open source massively parallel processing (MPP) SQL query engine
        for data stored in a computer cluster running Apache Hadoop. Impala has
        been described as the open-source equivalent of Google F1, which
        inspired its development in 2012.
      
 
      - 
        Presto - Distributed SQL Query Engine
        for Big Data. Open sourced by Facebook.
      
 
      - 
        Apache Tajo - Data warehouse
        system for Apache Hadoop
      
 
      - 
        Apache Drill - Schema-free SQL
        Query Engine
      
 
      - Apache Trafodion
 
    
    Data Management
    
      - 
        Apache Calcite - A Dynamic Data
        Management Framework
      
 
      - 
        Apache Atlas - Metadata
        tagging & lineage capture suppoting complex business data taxonomies
      
 
      - 
        Apache Kudu - Kudu provides a
        combination of fast inserts/updates and efficient columnar scans to
        enable multiple real-time analytic workloads across a single storage
        layer, complementing HDFS and Apache HBase.
      
 
      - 
        Confluent Schema registry for Kafka
        - Schema Registry provides a serving layer for your metadata. It
        provides a RESTful interface for storing and retrieving Avro schemas.
      
 
      - 
        Hortonworks Schema Registry
        - Schema Registry is a framework to build metadata repositories.
      
 
    
    
      Workflow, Lifecycle and Governance
    
    
      - Apache Oozie - Apache Oozie
 
      - Azkaban
 
      - 
        Apache Falcon - Data management
        and processing platform
      
 
      - 
        Apache NiFi - A dataflow system
      
 
      - 
        Apache AirFlow
        - Airflow is a workflow automation and scheduling system that can be
        used to author and manage data pipelines
      
 
      - 
        Luigi - Python
        package that helps you build complex pipelines of batch jobs
      
 
    
    Data Ingestion and Integration
    
    DSL
    
      - Apache Pig - Apache Pig
 
      - 
        Apache DataFu - A
        collection of libraries for working with large-scale data in Hadoop
      
 
      - 
        vahara - Machine
        learning and natural language processing with Apache Pig
      
 
      - 
        packetpig - Open
        Source Big Data Security Analytics
      
 
      - 
        akela - Mozilla’s
        utility library for Hadoop, HBase, Pig, etc.
      
 
      - 
        seqpig - Simple and
        scalable scripting for large sequencing data set(ex: bioinfomation) in
        Hadoop
      
 
      - 
        Lipstick - Pig
        workflow visualization tool.
        Introducing Lipstick on A(pache) Pig
      
 
      - 
        PigPen - PigPen is
        map-reduce for Clojure, or distributed Clojure. It compiles to Apache
        Pig, but you don’t need to know much about Pig to use it.
      
 
    
    
    
    Realtime Data Processing
    
      - Apache Storm
 
      - Apache Samza
 
      - Apache Spark
 
      - 
        Apache Flink
        - Apache Flink is a platform for efficient, distributed, general-purpose
        data processing. It supports exactly once stream processing.
      
 
      - 
        Apache Pulsar (incubating)
        - Apache Pulsar (incubating) is a highly scalable, low latency messaging
        platform running on commodity hardware. It provides simple pub-sub
        semantics over topics, guaranteed at-least-once delivery of messages,
        automatic cursor management for subscribers, and cross-datacenter
        replication.
      
 
      - 
        Apache Druid (incubating)
        - A high-performance, column-oriented, distributed data store.
      
 
    
    
      Distributed Computing and Programming
    
    
      - Apache Spark
 
      - 
        Spark Packages - A community
        index of packages for Apache Spark
      
 
      - 
        SparkHub - A community
        site for Apache Spark
      
 
      - Apache Crunch
 
      - 
        Cascading - Cascading is the
        proven application development platform for building data applications
        on Hadoop.
      
 
      - 
        Apache Flink - Apache Flink is a
        platform for efficient, distributed, general-purpose data processing.
      
 
      - 
        Apache Apex (incubating)
        - Enterprise-grade unified stream and batch processing engine.
      
 
      - 
        Apache Livy (incubating)
        - Apache Livy (incubating) is web service that exposes a REST interface
        for managing long running Apache Spark contexts in your cluster. With
        Livy, new applications can be built on top of Apache Spark that require
        fine grained interaction with many Spark contexts.
      
 
    
    
      Packaging, Provisioning and Monitoring
    
    
      - 
        Apache Bigtop - Apache Bigtop:
        Packaging and tests of the Apache Hadoop ecosystem
      
 
      - 
        Apache Ambari - Apache Ambari
      
 
      - 
        Ganglia Monitoring System
      
 
      - 
        ankush - A
        big data cluster management tool that creates and manages clusters of
        different technologies.
      
 
      - 
        Apache Zookeeper - Apache
        Zookeeper
      
 
      - 
        Apache Curator - ZooKeeper
        client wrapper and rich ZooKeeper framework
      
 
      - 
        inviso - Inviso is a
        lightweight tool that provides the ability to search for Hadoop jobs,
        visualize the performance, and view cluster utilization.
      
 
    
    Search
    
      - ElasticSearch
 
      - 
        Apache Solr - Apache Solr
        is an open source search platform built upon a Java library called
        Lucene.
      
 
      - 
        Banana - Kibana port
        for Apache Solr
      
 
    
    Search Engine Framework
    
      - 
        Apache Nutch - Apache Nutch is a
        highly extensible and scalable open source web crawler software project.
      
 
    
    Security
    
      - 
        Apache Ranger - Ranger
        is a framework to enable, monitor and manage comprehensive data security
        across the Hadoop platform.
      
 
      - 
        Apache Sentry - An
        authorization module for Hadoop
      
 
      - 
        Apache Knox Gateway - A REST API
        Gateway for interacting with Hadoop clusters.
      
 
      - 
        Project Rhino
        - Intel’s open source effort to enhance the existing data protection
        capabilities of the Hadoop ecosystem to address security and compliance
        challenges, and contribute the code back to Apache.
      
 
    
    Benchmark
    
      - 
        Big Data Benchmark
      
 
      - HiBench
 
      - 
        Big-Bench
      
 
      - 
        YCSB - The Yahoo!
        Cloud Serving Benchmark (YCSB) is an open-source specification and
        program suite for evaluating retrieval and maintenance capabilities of
        computer programs. It is often used to compare relative performance of
        NoSQL database management systems.
      
 
    
    
      Machine learning and Big Data analytics
    
    
      - Apache Mahout
 
      - 
        Oryx 2 - Lambda
        architecture on Spark, Kafka for real-time large scale machine learning
      
 
      - 
        MLlib - MLlib is Apache
        Spark’s scalable machine learning library.
      
 
      - 
        R - R is a free software
        environment for statistical computing and graphics.
      
 
      - 
        RHadoop
        including RHDFS, RHBase, RMR2, plyrmr
      
 
      - Apache Lens
 
      - 
        Apache SINGA (incubating)
        - SINGA is a general distributed deep learning platform for training big
        deep learning models over large datasets
      
 
      - 
        BigDL - BigDL is a
        distributed deep learning library for Apache Spark; with BigDL, users
        can write their deep learning applications as standard Spark programs,
        which can directly run on top of existing Spark or Hadoop clusters.
      
 
      - 
        Apache Hivemall (incubating)
        - Apache Hivemall is a scalable machine learning library that runs on
        Apache Hive, Spark and Pig.
      
 
    
    Misc.
    
      - Hive Plugins
 
      - 
        UDF
        
          - https://github.com/edwardcapriolo/hive_cassandra_udfs
 
          - https://github.com/livingsocial/HiveSwarm
 
          - 
            https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics
          
 
          - https://github.com/twitter/elephant-bird - Twitter
 
          - https://github.com/lovelysystems/ls-hive
 
          - https://github.com/klout/brickhouse
 
        
       
      - 
        Storage Handler
        
          - https://github.com/dvasilen/Hive-Cassandra
 
          - https://github.com/yc-huang/Hive-mongo
 
          - https://github.com/balshor/gdata-storagehandler
 
          - https://bitbucket.org/rodrigopr/redisstoragehandler
 
          - https://github.com/chimpler/hive-solr
 
          - https://github.com/bfemiano/accumulo-hive-storage-manager
 
        
       
      - 
        Libraries and tools
        
          - https://github.com/forward3d/rbhive
 
          - https://github.com/synctree/activerecord-hive-adapter
 
          - https://github.com/hrp/sequel-hive-adapter
 
          - https://github.com/forward/node-hive
 
          - https://github.com/recruitcojp/WebHive
 
          - 
            shib - WebUI for
            query engines: Hive and Presto
          
 
          - 
            https://github.com/dmorel/Thrift-API-HiveClient2 (Perl -
            HiveServer2)
          
 
          - 
            PyHive - Python
            interface to Hive and Presto
          
 
          - https://github.com/recruitcojp/OdbcHive
 
          - 
            Hive-Sharp
          
 
          - 
            HiveRunner - An
            Open Source unit test framework for hadoop hive queries based on
            JUnit4
          
 
          - 
            Beetest - A super
            simple utility for testing Apache Hive scripts locally for non-Java
            developers.
          
 
          - 
            Hive_test-
            Unit test framework for hive and hive-service
          
 
        
       
      - 
        Flume Plugins
        
      
 
    
    Resources
    Various resources, such as books, websites and articles.
    Websites
    Useful websites and articles
    
    Presentations
    
    Books
    
    Hadoop and Big Data Events
    
    Other Awesome Lists
    
      Other amazingly awesome lists can be found in the
      awesome-awesomeness
      and awesome list.