科技日新月異,當我們還在努力熟悉Hadoop 以及的主要相關的Ecosystem 專案的時候(如:Hbase、Hive、Pig...等) ,其他相關的Ecosystem也正以非常快的速度在蓬勃發展,每隔一陣子就會聽到新的Project 出現,此外更有許多Hadoop的替代方案,與改善架構也一直推陳出新,真是讓人追不甚追啊~XD
但要注意的是每一種技術的產生,都是為了準備解決某一類型的問題,在一頭跳進去研究這些東西前,我們是否已經了解和想清楚我們的問題在哪裡呢?
下面是整理最近聽到,看到,或是開始要研究的專案....
Real Time Process
Tez
Provides a general-purpose, highly customizable framework that creates
simplifies data-processing tasks across both small scale (low-latency)
and large-scale (high throughput) workloads in Hadoop.
Spark
Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
Yahoo storm-yarn
Storm-yarn enables Storm clusters to be deployed into machines managed by Hadoop YARN. It is still a work in progres
Apache S4 (CEP)
S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data
Fast/Real Time SQL-like Query
Apache Gora
The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop? MapReduce support
Apache
Drill is an open-source software framework that supports data-intensive
distributed applications for interactive analysis of large-scale
datasets. Drill is the open source version of Google's Dremel system
which is available as an IaaS service called Google BigQuery.
Berkley Shark [
https://github.com/amplab/shark/wiki]
Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive
Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that runs natively on Apache Hadoop
The
Stinger Initiative is a collection of development threads in the Hive
community that will deliver 100X performance improvements as well as SQL
compatibility.
Hypertable delivers maximum efficiency and superior performance over the competition which translates into major cost savings.
Cloudera
Morphlines is an open source framework that reduces the time and skills
necessary to build or change Search indexing applications
Apache Accumulo
The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance
data storage and retrieval system. Apache Accumulo is based on Google's
BigTable design and is built on
top of Apache
Hadoop,
Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-based access control and a server-side
programming mechanism that can modify key/value pairs at various points in the
data management process
The Kiji Project is a modular, open-source framework that enables developers and analysts to collect, analyze and use data in real-time applications.
Graph processing
Apache Giraph
Apache Giraph is an iterative graph processing system built for high scalability.
Apache Hama is a pure BSP (
Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.
Resource Manager
Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes
Kitten
is a set of tools for writing and running applications on YARN, the
general-purpose resource scheduling framework that ships with Hadoop
2.0.0
Pub/sub system (Message Transfer)
Wormhole
is to connect our user databases (UDBs) with other services so they can
operate based on the most current data. Here are three examples of
systems that receive updates via Wormhole:
1.
Caches - Refills and invalidation messages need to be sent to each
cache so they stay in sync with their local database and consistent with
each other.
2. Services - Services such as Graph Search that build
specialized social indexes need to remain current and receive updates to
the underlying data.
3. Data warehouse - The Data warehouses and
bulk processing systems (Hadoop, Hbase, Hive, Oracle) receive streaming,
real-time updates instead of relying on periodic database dumps.
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.
Others
The
Apache Crunch Java library provides a framework for writing, testing,
and running MapReduce pipelines. Its goal is to make pipelines that are
composed of many user-defined functions simple to write, easy to test,
and efficient to run.
Cloudera RecordBreaker
RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured Avro data, without any need to write parsers or extractors. Its goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.
Azkaban is a batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs.
Linkedin Datafu
DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics
It contains functions for:
PageRank
Statistics (e.g. quantiles, median, variance, etc.)
Sampling (e.g. weighted, reservoir, etc.)
Sessionization
Convenience bag functions (e.g. enumerating items)
Convenience utility functions (e.g. assertions, easier writing of EvalFuncs)
Set operations (intersect, union)
and more...
看到這麼多專案,現在的心情是興奮還是腿軟呢?XD
(Google 表示:你們就繼續追著我的車尾燈吧~~)
Reference:
[1]
A real-time bonanza: Facebook’s Wormhole and Yahoo’s streaming Hadoop
[2]
Storm-YARN Released as Open Source
[3]
5 reasons why the future of Hadoop is real-time (relatively speaking)
[4]
https://www.guru99.com/introduction-hive.html