2013年7月27日 星期六

Hadoop / Haddop like framework and ecosystem project

Source: hadoopsphere

科技日新月異,當我們還在努力熟悉Hadoop 以及的主要相關的Ecosystem 專案的時候(如:Hbase、Hive、Pig...等) ,其他相關的Ecosystem也正以非常快的速度在蓬勃發展,每隔一陣子就會聽到新的Project 出現,此外更有許多Hadoop的替代方案,與改善架構也一直推陳出新,真是讓人追不甚追啊~XD

但要注意的是每一種技術的產生,都是為了準備解決某一類型的問題,在一頭跳進去研究這些東西前,我們是否已經了解和想清楚我們的問題在哪裡呢?

下面是整理最近聽到,看到,或是開始要研究的專案....

Real Time Process


Tez Provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop.
Storm is a free and open source distributed realtime computation system.

Spark
Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

Yahoo storm-yarn
Storm-yarn enables Storm clusters to be deployed into machines managed by Hadoop YARN. It is still a work in progres

Apache S4  (CEP)
S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data

Fast/Real Time SQL-like Query

Apache Gora
The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop? MapReduce support


Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an IaaS service called Google BigQuery.

Berkley Shark  [https://github.com/amplab/shark/wiki]
Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive
Cloudera Impala 
Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that runs natively on Apache Hadoop
Hortonwork Stinger
The Stinger Initiative is a collection of development threads in the Hive community that will deliver 100X performance improvements as well as SQL compatibility.
Baidu Hypertable
Hypertable delivers maximum efficiency and superior performance over the competition which translates into major cost savings. 

Cloudera Morphlines 
Cloudera Morphlines is an open source framework that reduces the time and skills necessary to build or change Search indexing applications
Apache Accumulo
The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process
The Kiji Project is a modular, open-source framework that enables developers and analysts to collect, analyze and use data in real-time applications.

Graph processing


Apache Giraph
Apache Giraph is an iterative graph processing system built for high scalability.

Apache Hama is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.

Resource Manager


Apache Mesos  [git://git.apache.org/mesos.git]  (對等於Hadoop 的YARN)
Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes

Kitten is a set of tools for writing and running applications on YARN, the general-purpose resource scheduling framework that ships with Hadoop 2.0.0

Pub/sub system (Message Transfer)


Facebook Wormhole (Not Open source)
Wormhole is to connect our user databases (UDBs) with other services so they can operate based on the most current data. Here are three examples of systems that receive updates via Wormhole:

1. Caches - Refills and invalidation messages need to be sent to each cache so they stay in sync with their local database and consistent with each other.
2. Services - Services such as Graph Search that build specialized social indexes need to remain current and receive updates to the underlying data.
3. Data warehouse - The Data warehouses and bulk processing systems (Hadoop, Hbase, Hive, Oracle) receive streaming, real-time updates instead of relying on periodic database dumps.

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

Others


The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Cloudera RecordBreaker
RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured Avro data, without any need to write parsers or extractors. Its goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.

Azkaban is a batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs.

Linkedin Datafu
DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics

It contains functions for:

    PageRank
    Statistics (e.g. quantiles, median, variance, etc.)
    Sampling (e.g. weighted, reservoir, etc.)
    Sessionization
    Convenience bag functions (e.g. enumerating items)
    Convenience utility functions (e.g. assertions, easier writing of EvalFuncs)
    Set operations (intersect, union)
    and more...


看到這麼多專案,現在的心情是興奮還是腿軟呢?XD

(Google 表示:你們就繼續追著我的車尾燈吧~~)


Reference:
[1] A real-time bonanza: Facebook’s Wormhole and Yahoo’s streaming Hadoop
[2] Storm-YARN Released as Open Source
[3] 5 reasons why the future of Hadoop is real-time (relatively speaking)

張貼留言