2013年9月30日 星期一

Data Encryption for Hadoop - HadoopCryptoCompressor


As previously article (Security for Hadoop - Data Encryption) mentioned, data encryption is still not officially support .

Today I want to show you an interesting project call HadoopCryptoCompressor , which is a simple "compressor" for hadoop (really don't compress anythig) but enable you to encrypt your data with public key "AES/CBC/PKCS5Padding".

This project has also propose to Hadoop , The JIRA id is  HADOOP-7857.

Unfortunately, the original version started by geisbruch is not work for me. so I decided try to fix it . And I also merge another branch (fork by ubiquitousthey ).


Here is my fork& Patch version - howie/HadoopCryptoCompressor  , and I will show you how to use this plugin.

Tutorial


1. Install

1.1 Clone from Github
# git clone https://github.com/howie/HadoopCryptoCompressor.git crypto


1.2 Build with Maven

# cd crypto
# mvn install


Maven will generate HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar at ../crypto/target/

1.3 Modify /etc/hadoop/conf/core-site.xml


  io.compression.codecs  
  org.apache.hadoop.io.compress.CryptoCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.SnappyCodec



1.4 Copy jar to Hadoop Classpath

There are two way to copy HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar to classpath

A. Maunally copy

Directly copy Jar to every machine's /usr/lib/hadoop/lib/  and Modify /etc/hadoop/conf/hadoop-env.sh

export HADOOP_CLASSPATH=/usr/lib/hadoop/lib/HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar:${HADOOP_CLASSPATH}${JAVA_JDBC_LIBS}:${MAPREDUCE_LIBS}


Ps. this may not work in some full distributed environment.

B. Use -libjar to copy

Run some hadoop-example program such as wordcount with -libjar , hadoop will copy HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar to each node's HADOOP_CLASSPATH


After install HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar, Let's look into some scenario.

Scenario 1 - wordcount with encrypt data


1. Generate encrypt data

Choose any text file , and encrypt by HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar

# java -jar HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar -e -aeskey 123456 test.txt test.crypto


Notice that Hadoop Compression will trigger by detecting File name Extension , only if the encrypt file name is *.crypto.

2. Upload file to hdfs

# hadoop fs -put test.crypto /tmp/


3. Run wordcount

# hadoop  jar /usr/lib/hadoop/hadoop-examples.jar  wordcount -libjars file:///root/HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar -Dcrypto.secret.key=123456  /tmp/test.crypto /tmp/wc-test_data_aes


4. check the result

Finally you can checkout the wordcount result

# hadoop fs -cat /tmp/test.crypto /tmp/wc-test_data_aes


Scenario 2 - Hive with encrypt data


In this scenario we try to load an encrypt file into hive ,and can select by hive.

1. Generate an encrypt file and encrypt it

The following is the content of the example file. (company_Info.txt)


# ID, company , tel , address  



A1,Trend Micro,2-2378-9666, 台北市敦化南路一段198號

A2,Google,2-8729-6000, 台灣台北市信義區市府路45號

A3,Apple,0800-020-021,台北市信義區松智路1號19樓A


2. Encrypt the file

#java -jar HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar -e -aeskey 123456 -in company_Info.txt -out company_Info.crypto




3. Create hive table and load company_Info.crypto into it

First, generate a hive script for upload encrypt file. Here is the example

-- filename:upload.hive



CREATE TABLE IF NOT EXISTS test (

ID STRING,

Company_Name STRING,

Tel_Number STRING,

Address STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

STORED AS TEXTFILE;



LOAD DATA LOCAL INPATH '${hiveconf:file}'  OVERWRITE INTO TABLE companyInfo;


Second, use hive to execute upload.hive

#hive  -f uploadData.hive -hiveconf file=company_Info.crypto


4. select data from hive

Run hive , go into hive shell mode.
hive> set crypto.secret.key=123456;

hive> select * form companyInfo;

OK



A1    Trend Micro    2-2378-9666    台北市敦化南路一段198號

A2    Google         2-8729-6000    台灣台北市信義區市府路45號

A3    Apple          0800-020-021   台北市信義區松智路1號19樓A

Time taken: 7.128 seconds, Fetched: 3 row(s)

hive>



寫完這篇...真的有考慮換用logdown...Orz..

2013年9月29日 星期日

Scrum 與育兒



上圖是Scrum 最常用來舉例的雞與豬的故事,內容節錄自wiki如下:

Scrum當中定義了許多角色。按照對開發過程的參與情況,這些角色被分為兩組,即組和組。這個分組方法的由來是一個關於豬和雞合夥開餐館的笑話[4]
一天,一頭豬和一隻雞在路上散步。雞對豬說:「嗨,我們合夥開一家餐館怎麼樣?」豬回頭看了一下雞說:「好主意,那你準備給餐館起什麼名字呢?」雞想了想說:「叫『火腿和雞蛋』怎麼樣?」「那可不行」,豬說:「我把自己全搭進去了,而你只是參與而已。」

今天突然有感觸,對於生小孩和育兒這件事來說,爸爸與媽媽的角色也似乎是豬與雞的角色,媽媽必須全身跳下去參與,從懷胎,道生產,道哺乳...等,爸爸大多是參與與協助的角色。

(所以爸爸真的要多體貼媽媽一點~:P)

不過...我覺得育兒應該比較適用Kanban 才對因為:
  • 你沒有一個Product backlog (不過父母總是想把小孩打造成他們心目中的樣子?:P)
  • 也不會有一個明確的Iteration (但是還是會有幾個比較大的Milestone)
  • 而且永遠會有急件插單~@@

老爸(PM)唯一能做的事就是把把所有可能要做的事,和做的進度列出來堆蹤,然後也要設定Limit WIP (working in progress),不然接受太多超過自己能承受工作量可是會出人命的....Orz..

(謎之音:所以很多不必要的活動和計畫都得推掉和暫停....)


所以專案管理管的好,育兒起來就會比較順利嘛?

反之

育兒做的好都很順利,代表專案管理能力有一定的水平!?


且讓我們繼續看下去.....

2013年9月24日 星期二

Security for Hadoop - Data Encryption



工商服務時間:想要更深入了解HadoopSecurity 議題的人,可參加HadoopTaiwan 2013 ,在"Hadoop Security: Now and future" 這個Session會有TCloud的大大給更詳細的解說唷~~


Security 在資料儲存與資料庫這個領域,主要探討防護措施有分三種層級:

1. Authentication (你是誰?這個資料是你的嗎?)

2. Authorization (你對這個資料有怎樣的權限?對這個資料你可以做怎樣的動作?)

3. Encryption (如果這個資料不是你的/你沒有權限,就算是Admin 我也不給看!)


但是對Hadoop 界來說, Security 這個議題似乎相對冷門,為什麼呢?可能是:
  • 運算都來不及了還加密勒~
  • 系統維運已經夠複雜了,再加入Security 不是更麻煩!!
  • 反正資料那麼多~~那麼雜~~我也不怕你偷看,找的到再說(誤) 

其實也不盡然,只是相對起來急迫性較低,所以直到Hadoop 1.0.0 才正式加入Kerberos 的安全機制(解決了 1 的問題,2 目前還是靠Linux 的檔案管理機制 )。

在InfoQ的訪問中可以看到社群對於Hadoop Security 的看法:

InfoQ: What type of security features does this release support, in terms of authentication, access control and data encryption?
Arun: 1.0.0 supports strong, end-to-end Kerberos based authentication for both HDFS (filesystem for storage) and MapReduce (data processing). Kerberos is by far the most popular network authentication protocol used in the enterprise.
It also provides strong access control at all levels for applications and data. For example, one can ensure that only a certain individual (or set of users) can view running applications, see application logs etc.

而目前剩下還沒解決的問題就是Encryption的問題,雖然社群已經有人開始在著手設計與開發 (主要負責人是Andrew Purtell ),但是由於比較沒有急迫性,而且社群對於這個Feature興趣似乎比較不大,所以目前還沒有明確的Roadmap 顯示何時會Release ,有興趣的朋友可以追蹤以下幾個JIRA issue:

Hadoop-9331
Hadoop-9332
Hadoop-9333
Mapreduce-5025
Hbase-7544

另外Intel 也有提出了project-rhino 專案,也是想要解決Security 的問題

雖然Hadoop官方版本沒有提供,但是我想要保護我的資料該怎麼辦?我該怎麼實現Encryption 的功能!?
最快的方法就是Hack Compression 的機制

為什麼呢?仔細思考一下,Encryption跟 Compression 其實是非常類似的,都是過某種演算法把資料編碼,所以理論上只要在 Hadoop 所提供的這個參數io.compression.codecs 增加一個我們自己的 compression codec (其實就是加密codec),就應該可以達到Encryption的效過,而在網路上也真的有人做了一個這樣的plugin,也到Hadoop 提案 HADOOP-7857 (不過同樣的乏人問津~XD)


有興趣的可以到GitHub下載這個專案來玩玩看 HadoopCryptoCompressor ,不過要注意的是他的教學文件有些錯誤,更正的設定如下:

<property>  
 <name>io.compression.codecs</name>   
 <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec, 
 org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.CyptoCodec</value>
</property>


只要安裝這個jar 檔和在/etc/hadoop/conf/core-site.xml加入以上的設定,你就可以使用Encryption啦~

什麼講太少了?在這邊先賣個關子....

再度工商服務時間:想要更深入了解HadoopSecurity 議題的人,可參加HadoopTaiwan 2013 ,在"Hadoop Security: Now and future" 這個Session會有TCloud的大大給更詳細的解說唷~~XDDDD

此外 TCloud 在現場也有擺位唷~


(如果沒報到名的,那只好等會不會有投影片放出來摟~~:P)


2013年9月13日 星期五

RESTful API 文件產生器 Enunciate 輕鬆上手



接續上一篇"Restful JSON API 文件產生器整理",我最後選了Enunciate來當做Restful 的文件產生器,在使用上也實很簡單,只要在Maven 專案裡面增加以下的Depency

 
<dependency>
 <groupid>org.codehaus.enunciate</groupid>
 <artifactid>enunciate-jboss-rt</artifactid>
 <version>1.27</version>
</dependency>

 

(註:因為我的rest framework 是用jboss 的resteasy 所以選擇enunciate-jboss-rt,如果你是用其他的套件可以參考文件enuciate maven plugin)

然後在Maven Plugin 的地方加入


org.codehaus.enunciate
 maven-enunciate-jboss-plugin
    1.27
 
  true
  ${basedir}/src/main/resources/enunciate.xml
  enunciate
 
 
  
   install
    
     docs
    
   
   
    ${project.parent.basedir}/myproject-doc/docs
   
  
 

 

接下來只要每次run mvn install 就會自動產生文件在指定的位置,如我上面所指定的${project.parent.basedir}/myproject-doc/docs

產生的畫面如下圖所示:






有興趣的人還可以去客製化改ccs,而且Enunciate 這個plugin 最厲害的是還可以針對你的Restful API產生好幾種語言的Client,包含java php ruby



2013年9月2日 星期一

[筆記]在設計MQ時到底何時該用哪種Exchange Type




在討論MQ何時應該使用哪一種Exchange Type前,讓我們先來看看到底有多少種Exchange type可以選擇,因為每一種Exchange type都有適合的情境。



在設計與使用MQ的時候,有哪些需要注意的準則呢?以下內容節錄自 Best Practices for Maximizing Scalability and Cost Effectiveness of Queue-Based Messaging Solutions on Windows Azure

Queue-Based Messaging Fundamentals

A typical messaging solution that exchanges data between its distributed components using message queues includes publishers depositing messages into queues and one or more subscribers intended to receive these messages. In most cases, the subscribers, sometimes referred to as queue listeners, are implemented as single- or multi-threaded processes, either continuously running or initiated on demand as per a scheduling pattern.

At a higher level, there are two primary dispatch mechanisms used to enable a queue listener to receive messages stored on a queue:

    Polling (pull-based model): A listener monitors a queue by checking the queue at regular intervals for new messages. When the queue is empty, the listener continues polling the queue, periodically backing off by entering a sleep state.

    Triggering (push-based model): A listener subscribes to an event that is triggered (either by the publisher itself or by a queue service manager) whenever a message arrives on a queue. The listener in turn can initiate message processing thus not having to poll the queue in order to determine whether or not any new work is available.

所以在選擇Exchange Type 前要先思考,我們要的是Polling 還是 Triggering?

It is also worth mentioning that there are different flavors of both mechanisms. For instance, polling can be blocking and non-blocking. Blocking keeps a request on hold until a new message appears on a queue (or timeout is encountered) whereas a non-blocking request completes immediately if there is nothing on a queue. With a triggering model, a notification can be pushed to the queue listeners either for every new message, only when the very first message arrives to an empty queue or when queue depth reaches a certain level.

那到底最常用的有哪幾種?

Message Bus 






圖片來源:Message Bus



Command Message






圖片來源:Command Message



Message Route


圖片來源:Message Route



Dynamic Router


2013年9月1日 星期日

Restful JSON API 文件產生器整理

寫程式最痛苦的就是寫文件,更痛苦的就一旦程式或介面有修改,也要立即找到相對應的文件修改,也就是這樣才會有文件產生器的誕生,對於一般Java 程式最方變得當然是使用java doc 來產生文件。

但是只要是一談到跨系統跨語言的整合就沒有這麼簡單了,舉例來說 Web service,常見的Web Service 有兩種,一種是XML based 的 SOAP 本身就有一套嚴謹的文件產生規則,所以問題到不大,但是另一種 RESTful Service  就不是這樣了,因為它通常使用JSON的格式在交換訊息,而JSON又是一種很隨性鬆散的結構,這時候如果沒有專門為了 RESTful API的文件產生器那就會很痛苦了....

為了減輕這方面的痛苦,所以特別去找了目前市面上專門為了Java Solution的RESTful service 文件產生器。

不過絕大部分都是針對springmvc 去設計的.....

SpringDoclet


SpringDoclet is a Javadoc doclet that generates documentation on Spring Framwork artifacts in a project. The detection of Spring artifacts is based on the presence of Spring annotations on Java classes and methods.

RESTdoclet


IG Group’s RESTdoclet  is a Maven Doclet plugin for generating REST service documentation from Spring 3 REST implemented services. (必須使用 Spring 3 REST annotations)

Wsdoc


WsDoc is a documentation generator for Spring MVC REST services. Multi modules/war and unified report.

Swagger


Swagger is a specification and complete framework implementation for describing, producing, consuming, and visualizing RESTful web services.e overarching goal of Swagger is to enable client and documentation systems to update at the same pace as the server. The documentation of methods, parameters and models are tightly integrated into the server code, allowing APIs to always stay in sync. With Swagger, deploying managing, and using powerful APIs has never been easier. swagger-springmvc

JsonDoc


JSONDoc  has a set of completely MVC framework agnostic annotations


雖然知道使用Spring mvc 在java solution 佔大宗,但難道不是使用Spring mvc 開發就不行嘛!? 我想用Jersey 或是 resteasy呢?

後來我找到了Enunciate 它支援標準JAX-WS和 JAX-RS的 annotation,所以不一定要綁定在Spring mvc上 (不過他也有支援),不過他產生的介面算是裡面比較醜的...Orz..

Enunciate


Enunciate is an engine for dramatically enhancing your Java Web service API. (@javax.ws.rs.Path) It’s simple. You develop your Web service API using standard Java technologies and attach Enunciate to your build process. Suddenly, your Web service API is boasting some pretty impressive features:
- Full HTML documentation of your services, scraped from your JavaDocs.
- Client-side libraries (e.g. Java, .NET, iPhone, Ruby, Flex, AJAX, GWT, etc.) for developers who want to interface with your API.
- Interface Definition Documents (e.g. WSDL, XML-Schema, etc.)