2013年7月30日 星期二

閒聊分散式系統開發之痛 - API 與 Protocol


圖片來源:cloudave


最近在開發系統遇到了許多的困難與挑戰,也產生了滿肚子的苦水,一直常試著要把它寫下來,無奈詞窮,不知道該從什麼角度切入,不過既然是技術分享的文章,好像應該減少這些廢話與抱怨,直接切入主題比較實在?╮(╯▽╰)╭

[最後結論是...要回去多K J2EE Design Pattern...用正確的工具解決正確的問題...]

首先是API


如何設計好的API與命名一直以來都是個充滿挑戰的工作,也是令人頭痛,所以無怪乎網路上都有成千上萬的文章在討論如何設計好API,如:

對於寫程式來說,只要是溝通的介面都叫做API,所以從同一種語言內部的function call、程式語言呼叫作業系統的system call or IPC (Inter process communication)、一直到系統與系統之間的RPC (Remote Producer Call)都是屬於API的範疇,但是設計的難度與要考慮的東西可是有天壤之別,而且對於一般的Application 光設計好的API就令人夠頭疼了,更不用說遇到跨系統垮語言整合的分散式系統,要考慮的層面就更多。

如何交換?如何溝通 - Protocol

根據WIki 上對於Prototocol的定義:

Protocol 通常是提供給兩種不同的技術/語言之間做整合/溝通/交換資料,而API狹義的來說通常是指同一種語言的溝通介面,當然還是可以透過某些特定的轉換Lib來轉換或借接。

為了讓不同的技術/語言要能彼此交換資訊,他們就必須先定義好一種語言中立的檔案交換格式,例如SOAP 使用XML,然後為了使用方便(物件導向語言通常用Object 在操作),所以每種語言就會有相對應的工具來幫忙 (marshalled and unmarshalled)

那訊息如何交換? 通常透過以下幾種方式:

  • RPC (Remote Producer Call)
    • RMI
    • Web Service
    • proprietary socket command 
  • Message Exchange
    • JMS
    • ESB
    • Message Queue
  • DataBase
  • File Transfer 

    而通常在遇到跨系統跨語言的系統整合時,會採取下面幾種模式:
    • 採用Plan Text 且標準化Protocol 與結構化的語言,如XML JSON
      • 優點:結構清楚,任何語言都可以實作與讀取
      • 缺點:肥,相對慢
    • 採用Binary 的格式,自定資料結構與序列化Serialization的方法,如Thrift、protobuf、Avro..等。
      • 優點:體積小,速度快
      • 缺點:針對不同的語言都必須要提供對應的SDK,也較難上手

    第二種方式雖然有興趣,但是還沒有時間深入研究,有興趣的可以參考這些文章:
    另外也可以參考這個投影片:Thrift vs Protocol Buffers vs Avro - Biased Comparison


    下一篇文章主要是想針對使用Plan Text - JSON 在開發所遇到的問題來討論。

    [Update] 有興趣的也可以參考以下延伸閱讀

    [1] A fault tolerant, protocol-agnostic RPC system http://twitter.github.io/finagle
    [2] Don’t Use JSON And XML As Internal Transfer Formats
    [3] Modern distributed messaging and rpc

    2013年7月29日 星期一

    Eclipse (Juno/Kepler) Maven, M2E and EGit 相容性問題解法


    話說Eclipse 的特色就是他有很多的Plugin可以使用(算是使用OSGI架構的最成功案例),但是最讓人詬病的地方偏偏也就是plugin 相依性的問題(真是成也蕭何敗也蕭何),因為plugin種類太多,版本演進太快,所以常常會有各個plugin 版本衝突無法使用的問題。這次要解決的問題就使用Maven + EGit 所產生的問題,這個問題從Juno 版本就出現,但是到Kepler版本都還沒有解決,真是讓人失望。

    問題是這樣產生的:
    當我們使用Eclipse Import 的功能時,可以選擇Check out Maven Project from SCM


    預設SCM type 是空的,必須要透過右下角m2e Marketplace 安裝SCM Connectiors




    但是從Juno 版開始,不管是選擇egit或是subclipse都會出現這樣的錯誤:

    Cannot complete the install because one or more required items could not be found.
    Missing requirement:Maven SCM Handler for EGit




    上網查了一下參考[1]的文章,原來是m2e marketplace 上支援的connector 版本跟EGit不相容....=_=

    所以解決方法如下:

    1. 點選Eclipse 上方 Menu的 Help
    2. 選擇Install New Software
        (記得 Uncheck  Group items by category 的選項,這樣等下才看的到安裝內容)
    3. 按Add 輸入以下URL (選擇0.14.0的最新版本0.14.0.201305250025)


         http://repository.tesla.io:8081/nexus/content/sites/m2e.extras/m2eclipse-egit/0.14.0/N/0.14.0.201305250025/




    4. 按下確定後安裝

    這樣Maven + Egit 就可以正常使用了





    Reference:
    [1] Eclipse Juno, Maven, M2E and EGit Compatibility Problem and Solution

    2013年7月27日 星期六

    Hadoop / Haddop like framework and ecosystem project

    Source: hadoopsphere

    科技日新月異,當我們還在努力熟悉Hadoop 以及的主要相關的Ecosystem 專案的時候(如:Hbase、Hive、Pig...等) ,其他相關的Ecosystem也正以非常快的速度在蓬勃發展,每隔一陣子就會聽到新的Project 出現,此外更有許多Hadoop的替代方案,與改善架構也一直推陳出新,真是讓人追不甚追啊~XD

    但要注意的是每一種技術的產生,都是為了準備解決某一類型的問題,在一頭跳進去研究這些東西前,我們是否已經了解和想清楚我們的問題在哪裡呢?

    下面是整理最近聽到,看到,或是開始要研究的專案....

    Real Time Process


    Tez Provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop.
    Storm is a free and open source distributed realtime computation system.

    Spark
    Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

    Yahoo storm-yarn
    Storm-yarn enables Storm clusters to be deployed into machines managed by Hadoop YARN. It is still a work in progres

    Apache S4  (CEP)
    S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data

    Fast/Real Time SQL-like Query

    Apache Gora
    The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop? MapReduce support


    Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an IaaS service called Google BigQuery.

    Berkley Shark  [https://github.com/amplab/shark/wiki]
    Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive
    Cloudera Impala 
    Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that runs natively on Apache Hadoop
    Hortonwork Stinger
    The Stinger Initiative is a collection of development threads in the Hive community that will deliver 100X performance improvements as well as SQL compatibility.
    Baidu Hypertable
    Hypertable delivers maximum efficiency and superior performance over the competition which translates into major cost savings. 

    Cloudera Morphlines 
    Cloudera Morphlines is an open source framework that reduces the time and skills necessary to build or change Search indexing applications
    Apache Accumulo
    The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process
    The Kiji Project is a modular, open-source framework that enables developers and analysts to collect, analyze and use data in real-time applications.

    Graph processing


    Apache Giraph
    Apache Giraph is an iterative graph processing system built for high scalability.

    Apache Hama is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.

    Resource Manager


    Apache Mesos  [git://git.apache.org/mesos.git]  (對等於Hadoop 的YARN)
    Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes

    Kitten is a set of tools for writing and running applications on YARN, the general-purpose resource scheduling framework that ships with Hadoop 2.0.0

    Pub/sub system (Message Transfer)


    Facebook Wormhole (Not Open source)
    Wormhole is to connect our user databases (UDBs) with other services so they can operate based on the most current data. Here are three examples of systems that receive updates via Wormhole:

    1. Caches - Refills and invalidation messages need to be sent to each cache so they stay in sync with their local database and consistent with each other.
    2. Services - Services such as Graph Search that build specialized social indexes need to remain current and receive updates to the underlying data.
    3. Data warehouse - The Data warehouses and bulk processing systems (Hadoop, Hbase, Hive, Oracle) receive streaming, real-time updates instead of relying on periodic database dumps.

    Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

    Others


    The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

    Cloudera RecordBreaker
    RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured Avro data, without any need to write parsers or extractors. Its goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.

    Azkaban is a batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs.

    Linkedin Datafu
    DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics

    It contains functions for:

        PageRank
        Statistics (e.g. quantiles, median, variance, etc.)
        Sampling (e.g. weighted, reservoir, etc.)
        Sessionization
        Convenience bag functions (e.g. enumerating items)
        Convenience utility functions (e.g. assertions, easier writing of EvalFuncs)
        Set operations (intersect, union)
        and more...


    看到這麼多專案,現在的心情是興奮還是腿軟呢?XD

    (Google 表示:你們就繼續追著我的車尾燈吧~~)


    Reference:
    [1] A real-time bonanza: Facebook’s Wormhole and Yahoo’s streaming Hadoop
    [2] Storm-YARN Released as Open Source
    [3] 5 reasons why the future of Hadoop is real-time (relatively speaking)
    [4] https://www.guru99.com/introduction-hive.html