Apache Kafka 又一個messaging system !?

圖片來源：網路

這次在ApacheCon2013 上看到 Apache Kafka 的專案介紹，心想又一個messaging system !?

Intra-cluster Replication in Apache Kafka

於是又好奇的去收集了許多資料，首先參考Slideshare的一篇介紹Apache Kafka 的投影片，Kafka號稱最大的特色是同時混和了Offline log以及Realtime Message 兩種功能。

還記得上一篇文章我有提到各種分散式Log-aggregation系統 (如Scribe 和 Flume)，他們的架構都是Push driven architecture，雖然具高效能高擴充性，但是有以下缺點：

預期的端點(End points)都是大型叢集(如：Hadoop)
端點(End points)不能有太多即時性商業邏輯 (business logic in real-time)

因為他們最主要的工作就是盡量快速的去消化客戶端push 過來的log，而跟傳統的Messaging System(如：RabbitMQ , ActiveMQ)，並不夠Scale ?：
(不太懂這邊的意思，還需要要研究一下...)

No API for batching, transcational (broker retain consumers stream position)
No Message persistence means multiple consumers over time are impossible limiting architecture

圖片來源：Apache Kafka

談到MQ大家最有興趣的一點就是效能了，看到投影片的圖...有沒有那麼神！？

圖片來源：Apache Kafka

不過仔細看benchmark 的條件與環境設定，還是發現他為了增加吞吐的效能，對於資料的的完整性和確保性作了取捨：

Kafka producer currently doesn't wait for ack form the broker. Without ack , there is no gurantee that every published message is actually receved by the broker

所以如果是必須保證送達的系統，可能就不能使用Kafka 了~

RabbitMQ vs. Kafka

了解了Kafka特性後，還是不免會想要跟RabbitMQ作一下比較，參考Quora的這篇文章"RabbitMQ vs Kafka which one for durable messaging with good query feature"，針對兩個系統的特色，節錄以下內容：

a) Use Kafka if you have a fire hose of events (100k+/sec) you need delivered in partitioned order 'at least once' with a mix of online and batch consumers, you want to be able to re-read messages, you can deal with current limitations around node-level HA (or can use trunk code), and/or you don't mind supporting incubator-level software yourself via forums/IRC.

b) Use Rabbit if you have messages (20k+/sec) that need to be routed in complex ways to consumers, you want per-message delivery guarantees, you don't care about ordered delivery, you need HA at the cluster-node level now, and/or you need 24x7 paid support in addition to forums/IRC.

不過作者也回答道，如果要求的是Real time streaming process (filter/query) 的功能的話，這兩個都不適合，反而應該考慮Storm：

Neither offers great "filter/query" capabilities - if you need that, consider using Storm on top of one of these solutions to add computation, filtering, querying, on your streams

透過這些文章與解釋我似乎開始了解 Kafka or Storm 的用途與應用情境，也漸漸了解到為什麼很多公司都是使這兩個系統用來補足Hadoop real-time process 不足的地方，因為實在有太多公司使用這種組合，下面舉例infochimps這間公司如何看待Storm and Kafka：

Why should you care?

With Storm and Kafka, you can conduct stream processing at linear scale, assured that every message gets processed in real-time, reliably. In tandem, Storm and Kafka can handle data velocities of tens of thousands of messages every second.
Stream processing solutions like Storm and Kafka have caught the attention of many enterprises due to their superior approach to ETL (extract, transform, load) and data integration.
Storm and Kafka are also great at in-memory analytics, and real-time decision support. Companies are quickly realizing that batch processing in Hadoop does not support real-time business needs. Real-time streaming analytics is a must-have component in any enterprise Big Data solution or stack, because of how elegantly they handle the “three V’s” — volume, velocity and variety.
Storm and Kafka are the two technologies on the list that we’re most committed to at Infochimps, and it is reasonable to expect that they’ll be a formal part of our platform soon.