注意事項:
1. Shark 0.7.0 所支援的是Hive 0.9.0 的API和 Metastore Schema,而Hive 0.10.0 不管在API還是Metastore上都有許多改變,所以Shark 0.7.0是無法在運作的,根據社群Mail List 預計在0.8.0才會支援Hive 0.10.0+ 版本 (所以你安裝的環境已經有Hive 而且是0.10.0+ 那就得移除改用0.9.0....)
1-2. 如果使用cloudera 請安裝CDH 4.1.x 版本,裡面使用的Hive 是 0.9.0
2. 如果要把Shark 跑在Spark Cluster上,則每一台Node都要安裝Shark
3. 雖說Shark100%相容Hive,但是其實還是有些地方還不支援或是還沒實作,所以請先參考Compatibility-with-Apache-Hive 這篇文章,像是我就採到下面這個地雷:
如果是跑在Spark Cluster 上面,每次下指令必須先設定mapred.reduce.tasks=number,如果不設定的話跑任何.hql 都會沒有任何結果產生,但是也不會有任何錯誤訊息,最後才在裡面找到原因:
Automatically determine the number of reducers for joins and groupbys: Currently in Shark, you need to control the degree of parallelism post-shuffle using "set mapred.reduce.tasks=[num_tasks];". We are going to add auto-setting of parallelism in the next release.
安裝
0. 安裝Hadoop
話說在上一篇Spark 安裝教學裡面,完全沒有提到要安裝Hadoop,也不一定要安裝,但是在Shark 的Case 建議是一定要安裝Hadoop,因為Shark 使用的是Hive 的 lib ,用的是Hive 的Metastore,所以最簡單的測試方法就是安裝好Hive ,透過 Hive 把資料儲存到 HDFS 以及Metastore,之後再由Shark 存取分析(直接用Hive語法),這樣還可以順便測試Hive和Shark的效能。
所以請各位自行安裝Hadoop(我是用CDH4 安裝YARN,hdfs,map-reduce這樣就夠了...)
1. 下載Shark (裡面會包含已經patch 好的一包Hive)
$ wget http://spark-project.org/download/shark-0.7.0-hadoop2-bin.tgz # Hadoop 2/CDH4 $ tar xvfz shark-0.7.0-*-bin.tgz $ mv shark-0.7.0 /usr/lib/ $ ln -s /usr/lib/shark-0.7.0 /usr/lib/shark $ mv hive-0.9.0-bin /usr/lib $ ln -s /usr/lib/hive-0.9.0-bin /usr/lib/hive
2. 設定Hive MetaSotre
2-1. 安裝mysql
2-2. 登入mysql 設定metastore schema
mysql> CREATE DATABASE metastore; mysql> USE metastore; mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.9.0.mysql.sql; mysql> CREATE USER 'hive'@'lab-hadoop-m1' IDENTIFIED BY 'my password'; mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'lab-hadoop-m1'; mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'lab-hadoop-m1'; mysql> FLUSH PRIVILEGES; mysql> quit;
3. 設定Hive /usr/lib/hive/conf/hive-site.xml
javax.jdo.option.ConnectionURL jdbc:mysql://lab-hadoop-m1/metastore JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore hive.metastore.uris thrift://lab-hadoop-m1:9083 IP address (or fully-qualified domain name) and port of the metastore host fs.defaultFS hdfs://lab-hadoop-m2.tcloud:8020 mapred.job.tracker lab-hadoop-m2.tcloud:8021
4. 設定Shark設定檔 /usr/lib/shark/conf/shark-env.sh
#!/usr/bin/env bash # Copyright (C) 2012 The Regents of The University California. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # (Required) Amount of memory used per slave node. This should be in the same # format as the JVM's -Xmx option, e.g. 300m or 1g. export SPARK_MEM=1g # (Required) Set the master program's memory export SHARK_MASTER_MEM=1g # (Required) Point to your Scala installation. export SCALA_HOME=$SCALA_HOME # (Required) Point to the patched Hive binary distribution export HIVE_HOME=/usr/lib/hive # (Optional) Specify the location of Hive's configuration directory. By default, # it points to $HIVE_HOME/conf #export HIVE_CONF_DIR="$HIVE_HOME/conf" # For running Shark in distributed mode, set the following: export HADOOP_HOME=/usr/lib/hadoop export SPARK_HOME=/usr/lib/spark export MASTER=spark://lab-hadoop-m1:7077 #export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so # (Optional) Extra classpath #export SPARK_LIBRARY_PATH="" # Java options # On EC2, change the local.dir to /mnt/tmp SPARK_JAVA_OPTS="-Dspark.local.dir=/tmp " SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 " SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps " export SPARK_JAVA_OPTS
當一切都設定好,在console 下打shark 就可以進入shark shell ~
[root@lab-hadoop-m1 ~]# shark Starting the Shark Command Line Client WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Logging initialized using configuration in jar:file:/usr/lib/hive-0.9.0-bin/lib/hive-common-0.9.0-amplab-4.jar!/hive-log4j.properties Hive history file=/tmp/root/hive_job_log_root_201308131708_291740705.txt shark>
如果出現這個畫面,應該就代表安裝好了,可以直接下hive的語法操作~
Referecne:
[1] Running Shark on a Cluster
沒有留言 :
張貼留言