2013年1月23日 星期三

Big Data Analytic - Weka vs Mahout vs R

圖片來源:自行繪製

俗話說的好 "計畫趕不上變化,變化又趕不上長官的一句話 ╮(╯▽╰)╭",所以我最近的首要任務變成要研究所謂 Big Data Analytic 相關的技術,不過這樣也好,因為我早就想把這些東西搞清楚了...(請叫我 Survey魔人.... (  ̄ c ̄)y·ξ )

(註:這不是一篇技術文,是筆記碎念文~)

Machine Learning ? Data Mining ?


第一個問題,談到所謂的Big Data Analytic馬上浮出腦袋的就是最近熱門的幾個名詞: Data Mining (DM)? Machine Learning (ML)? 我真的是傻傻分不清啊,到底DM與ML之前的相互依賴關係到底是如何?到底是互相獨立還是互補關係?剛好又在網路上找到某張圖在解釋何謂Data Science,Data Science需要用到哪些技術,不過看了只會更加混亂吧~(/‵Д′)/~ ╧╧)



後來又找到Data Mining探索這篇文章後,我才比較確定DM和ML應該算是相互獨立的技術,但是可以利用ML的技術來幫助DM,當做一種DM的工具。

WEKA? R? Mahout ?


第二個問題就是關於 WEKA? R? Mahout ? 這幾個技術到底差異在哪,以及各自擅長的領域為何?在此之前我只知道這些都是號稱用來處理Data Mining 的 Open Source Project 。

在網路上找到這些資料:

WEKA 官網的介紹如下:
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

所以WEKA的強項就是他擁有非常多的ML演算法可以提供使用,但是他只是和較小量的資料運算,一旦遇到大量分散式運算,就會遇到以下問題:
WEKA can spend days for a single learn-and-test cycle, or it can simply run out of memory; and not with an average machine, even with a really big server!
而相對的Mahout 則是設計來For 大規模分散式運算 (雖然他現在支援的演算法較少,而且適用範圍較有局限性),下面的描述節錄自 mahout mail list討論串

Both packages support supervised and unsupervised algorithms.  Due to scalability concerns, Mahout does not have much in the way of agglomerative algorithms.

The highlights of Mahout right now are:
  • very large scale SVD
  • very large scale clustering
  • scalable item-set detection
  • the beginnings of very strong supervised classifiers for large features sets x large training data
  • decent underlying math library
  • command line or API focus.  This is better than GUI focus for production work.

所以不管任何工具都有其優缺點和侷限性,於是Hybrid Model就因蘊而生,下圖就是目前找到的幾總組合

圖片來源:自行整理

相關的參考資料:

R + WEKA:
R+ Hadoop:
此外,硬是要把R跟Hadoop 結合起來,並不一定是Data Scientists 所需要的東西(可能還是太偏一般程式設計師的想法),在"Big Analytics with R, Cassandra, and Hive"這篇文章,作者提到:
Most of the data scientists I’ve spoken to don’t really want this, they really want ways to get data into R and use data sampling and other estimation techniques (for example hive sampling)
看了那麼多工具,方法,與介紹,是不是覺得越來越昏了?所以最後引用這兩篇文章"R Is Not Enough For "Big Data","Yes, you need more than just R for Big Data Analytics"的內容當做結論:

If you ask the wrong question, you will be able to find statistics that give answers that are simply wrong (or, at best, misleading).

On net, having a degree in math, economics, AI, etc., isn’t enough. Tool expertise isn’t enough.  You need experience in solving real world problems, because there are a lot of importat limitations to the statistics that you learned in school.  Big data isn’t about bits, it’s about talent.

This is a great illustration of why the data science process is a valuable one for extracting information from Big Data, because it combines tool expertise with statistical expertise and the domain expertise required to understand the problem and the data applicable to it

工具和資料都不是重點,重點是你想要解決什麼問題....hmm...

(不過我還是得繼續研究我有哪些工具可以用,這些工具分別用來解決哪些問題...Orz...)

延伸閱讀:
[1] Machine Learning with Hadoop
[2] The Search for a Better BIG Data Analytics Pipeline

張貼留言