Awesome Spark
      
    
    
      A curated list of awesome
      Apache Spark packages and
      resources.
    
    
      Apache Spark is an open-source cluster-computing framework. Originally
        developed at the
        University of California, Berkeley’s AMPLab, the
        Spark codebase was later donated to the
        Apache Software Foundation, which
        has maintained it since. Spark provides an interface for programming
        entire clusters with implicit data parallelism and fault-tolerance
      (Wikipedia 2017).
    
    
      Users of Apache Spark may choose between different the Python, R, Scala
      and Java programming languages to interface with the Apache Spark APIs.
    
    Contents
    
    Packages
    Language Bindings
    
    Notebooks and IDEs
    
      - 
        almond
        
        - A scala kernel for Jupyter.
       
      - 
        Apache Zeppelin
        
        - Web-based notebook that enables interactive data analytics with
        plugable backends, integrated plotting, and extensive Spark support
        out-of-the-box.
       
      - 
        Polynote
        
        - Polynote: an IDE-inspired polyglot notebook. It supports mixing
        multiple languages in one notebook, and sharing data between them
        seamlessly. It encourages reproducible notebooks with its immutable data
        model. Orginating from
        Netflix.
       
      - 
        Spark Notebook
        
        - Scalable and stable Scala and Spark focused notebook bridging the gap
        between JVM and Data Scientists (incl. extendable, typesafe and reactive
        charts).
       
      - 
        sparkmagic
        
        - Jupyter magics and kernels for
        working with remote Spark clusters, for interactively working with
        remote Spark clusters through
        Livy, in Jupyter
        notebooks.
       
    
    General Purpose Libraries
    
      - 
        Succinct
        
- Support for efficient queries on compressed data.
       
      - 
        itachi
        
        - A library that brings useful functions from modern database management
        systems to Apache Spark.
       
      - 
        spark-daria
        
        - A Scala library with essential Spark functions and extensions to make
        you more productive.
       
      - 
        quinn
        
        - A native PySpark implementation of spark-daria.
       
      - 
        Apache DataFu
        
        - A library of general purpose functions and UDF’s.
       
    
    SQL Data Sources
    
      SparkSQL has
      serveral built-in Data Sources
      for files. These include csv, json,
      parquet, orc, and avro. It also
      supports JDBC databases as well as Apache Hive. Additional data sources
      can be added by including the packages listed below, or writing your own.
    
    
    Storage
    
      - 
        Delta Lake
        
        - Storage layer with ACID transactions.
       
    
    
    
      - 
        ADAM
        
        - Set of tools designed to analyse genomics data.
       
      - 
        Hail
        
        - Genetic analysis framework.
       
    
    GIS
    
      - 
        Magellan
        
        - Geospatial analytics using Spark.
       
      - 
        GeoSpark
        
        - Cluster computing system for processing large-scale spatial data.
       
    
    Time Series Analytics
    
      - 
        Spark-Timeseries
        
        - Scala / Java / Python library for interacting with time series data on
        Apache Spark.
       
      - 
        flint
        
        - A time series library for Apache Spark.
       
    
    Graph Processing
    
      - 
        Mazerunner
        
        - Graph analytics platform on top of Neo4j and GraphX.
       
      - 
        GraphFrames
        
        - Data frame based graph API.
       
      - 
        neo4j-spark-connector
        
        - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX /
        GraphFrames support.
       
      - 
        SparklingGraph
        
        - Library extending GraphX features with multiple functionalities useful
        in graph analytics (measures, generators, link prediction etc.).
       
    
    Machine Learning Extension
    
    Middleware
    
      - 
        Livy
        
        - REST server with extensive language support (Python, R, Scala),
        ability to maintain interactive sessions and object sharing.
       
      - 
        spark-jobserver
        
        - Simple Spark as a Service which supports objects sharing using so
        called named objects. JVM only.
       
      - 
        Mist
        
        - Service for exposing Spark analytical jobs and machine learning models
        as realtime, batch or reactive web services.
       
      - 
        Apache Toree
        
        - IPython protocol based middleware for interactive applications.
       
      - 
        Kyuubi
        
        - Improved implementation of Thrift JDBC/ODBC Server.
       
    
    Monitoring
    
    Utilities
    
      - 
        silex
        
        - Collection of tools varying from ML extensions to additional RDD
        methods.
       
      - 
        sparkly
        
        - Helpers & syntactic sugar for PySpark.
       
      - 
        pyspark-stubs
        
        - Static type annotations for PySpark (obsolete since Spark 3.1. See
        SPARK-32681).
       
      - 
        Flintrock
        
        - A command-line tool for launching Spark clusters on EC2.
       
      - 
        Optimus
        
        - Data Cleansing and Exploration utilities with the goal of simplifying
        data cleaning.
       
    
    Natural Language Processing
    
    Streaming
    
      - 
        Apache Bahir
        
        - Collection of the streaming connectors excluded from Spark 2.0 (Akka,
        MQTT, Twitter. ZeroMQ).
       
    
    Interfaces
    
      - 
        Apache Beam
        
        - Unified data processing engine supporting both batch and streaming
        applications. Apache Spark is one of the supported execution
        environments.
       
      - 
        Blaze
        
        - Interface for querying larger than memory datasets using Pandas-like
        syntax. It supports both Spark DataFrames and
        RDDs.
       
      - 
        Koalas
        
        - Pandas DataFrame API on top of Apache Spark.
       
    
    Testing
    
      - 
        deequ
        
        - Deequ is a library built on top of Apache Spark for defining “unit
        tests for data”, which measure data quality in large datasets.
       
      - 
        spark-testing-base
        
        - Collection of base test classes.
       
      - 
        spark-fast-tests
        
        - A lightweight and fast testing framework.
       
    
    Web Archives
    
    Workflow Management
    
    Resources
    Books
    
    Papers
    
    MOOCS
    
    Workshops
    
    Projects Using Spark
    
      - 
        Oryx 2 -
        Lambda architecture
        platform built on Apache Spark and
        Apache Kafka with specialization
        for real-time large scale machine learning.
      
 
      - 
        Photon ML - A
        machine learning library supporting classical Generalized Mixed Model
        and Generalized Additive Mixed Effect Model.
      
 
      - 
        PredictionIO - Machine Learning
        server for developers and data scientists to build and deploy predictive
        applications in a fraction of the time.
      
 
      - 
        Crossdata - Data
        integration platform with extended DataSource API and multi-user
        environment.
      
 
    
    Blogs
    
      - 
        Spark Technology Center - Great
        source of highly diverse posts related to Spark ecosystem. From
        practical advices to Spark commiter profiles.
      
 
    
    Docker Images
    
    Miscellaneous
    
    References
    
      Wikipedia. 2017. “Apache Spark — Wikipedia, the Free Encyclopedia.”
      https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753.
    
    License
    
      
        
      
      
      This work (Awesome Spark, by
      https://github.com/awesome-spark/awesome-spark), identified by
      Maciej Szymkiewicz, is free of known copyright restrictions.
    
    
      Apache Spark, Spark, Apache, and the Spark logo are
      trademarks of
      The Apache Software Foundation. This
      compilation is not endorsed by The Apache Software Foundation.
    
    
      Inspired by
      sindresorhus/awesome.