As previously article (Security for Hadoop - Data Encryption) mentioned, data encryption is still not officially support .
Today I want to show you an interesting project call HadoopCryptoCompressor , which is a simple "compressor" for hadoop (really don't compress anythig) but enable you to encrypt your data with public key "AES/CBC/PKCS5Padding".
This project has also propose to Hadoop , The JIRA id is HADOOP-7857.
Unfortunately, the original version started by geisbruch is not work for me. so I decided try to fix it . And I also merge another branch (fork by ubiquitousthey ).
Here is my fork& Patch version - howie/HadoopCryptoCompressor , and I will show you how to use this plugin.
Tutorial
1. Install
1.1 Clone from Github# git clone https://github.com/howie/HadoopCryptoCompressor.git crypto
1.2 Build with Maven
# cd crypto # mvn install
Maven will generate HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar at ../crypto/target/
1.3 Modify /etc/hadoop/conf/core-site.xml
io.compression.codecs org.apache.hadoop.io.compress.CryptoCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.SnappyCodec
1.4 Copy jar to Hadoop Classpath
There are two way to copy HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar to classpath
A. Maunally copy
Directly copy Jar to every machine's /usr/lib/hadoop/lib/ and Modify /etc/hadoop/conf/hadoop-env.sh
export HADOOP_CLASSPATH=/usr/lib/hadoop/lib/HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar:${HADOOP_CLASSPATH}${JAVA_JDBC_LIBS}:${MAPREDUCE_LIBS}
Ps. this may not work in some full distributed environment.
B. Use -libjar to copy
Run some hadoop-example program such as wordcount with -libjar , hadoop will copy HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar to each node's HADOOP_CLASSPATH
After install HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar, Let's look into some scenario.
Scenario 1 - wordcount with encrypt data
1. Generate encrypt data
Choose any text file , and encrypt by HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar
# java -jar HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar -e -aeskey 123456 test.txt test.crypto
Notice that Hadoop Compression will trigger by detecting File name Extension , only if the encrypt file name is *.crypto.
2. Upload file to hdfs
# hadoop fs -put test.crypto /tmp/
3. Run wordcount
# hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount -libjars file:///root/HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar -Dcrypto.secret.key=123456 /tmp/test.crypto /tmp/wc-test_data_aes
4. check the result
Finally you can checkout the wordcount result
# hadoop fs -cat /tmp/test.crypto /tmp/wc-test_data_aes
Scenario 2 - Hive with encrypt data
In this scenario we try to load an encrypt file into hive ,and can select by hive.
1. Generate an encrypt file and encrypt it
The following is the content of the example file. (company_Info.txt)
# ID, company , tel , address A1,Trend Micro,2-2378-9666, 台北市敦化南路一段198號 A2,Google,2-8729-6000, 台灣台北市信義區市府路45號 A3,Apple,0800-020-021,台北市信義區松智路1號19樓A
2. Encrypt the file
#java -jar HadoopCryptoCompressor-0.0.6-SNAPSHOT.jar -e -aeskey 123456 -in company_Info.txt -out company_Info.crypto
3. Create hive table and load company_Info.crypto into it
First, generate a hive script for upload encrypt file. Here is the example
-- filename:upload.hive CREATE TABLE IF NOT EXISTS test ( ID STRING, Company_Name STRING, Tel_Number STRING, Address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '${hiveconf:file}' OVERWRITE INTO TABLE companyInfo;
Second, use hive to execute upload.hive
#hive -f uploadData.hive -hiveconf file=company_Info.crypto
4. select data from hive
Run hive , go into hive shell mode.
hive> set crypto.secret.key=123456; hive> select * form companyInfo; OK A1 Trend Micro 2-2378-9666 台北市敦化南路一段198號 A2 Google 2-8729-6000 台灣台北市信義區市府路45號 A3 Apple 0800-020-021 台北市信義區松智路1號19樓A Time taken: 7.128 seconds, Fetched: 3 row(s) hive>
寫完這篇...真的有考慮換用logdown...Orz..