第一次玩Spark Shark 就上手 - Shark 安裝篇

2013年8月13日星期二

第一次玩Spark Shark 就上手 - Shark 安裝篇

承接上一篇"第一次玩Spark Shark 就上手 - Spark 安裝篇"，接下來要安裝的是Shark，Shark在安裝上就麻煩多了，也有許多地方要注意。

注意事項：

1. Shark 0.7.0 所支援的是Hive 0.9.0 的API和 Metastore Schema，而Hive 0.10.0 不管在API還是Metastore上都有許多改變，所以Shark 0.7.0是無法在運作的，根據社群Mail List 預計在0.8.0才會支援Hive 0.10.0+ 版本 (所以你安裝的環境已經有Hive 而且是0.10.0+ 那就得移除改用0.9.0....)

1-2. 如果使用cloudera 請安裝CDH 4.1.x 版本，裡面使用的Hive 是 0.9.0

2. 如果要把Shark 跑在Spark Cluster上，則每一台Node都要安裝Shark

3. 雖說Shark100%相容Hive，但是其實還是有些地方還不支援或是還沒實作，所以請先參考Compatibility-with-Apache-Hive 這篇文章，像是我就採到下面這個地雷：

如果是跑在Spark Cluster 上面，每次下指令必須先設定mapred.reduce.tasks=number，如果不設定的話跑任何.hql 都會沒有任何結果產生，但是也不會有任何錯誤訊息，最後才在裡面找到原因：

Automatically determine the number of reducers for joins and groupbys: Currently in Shark, you need to control the degree of parallelism post-shuffle using "set mapred.reduce.tasks=[num_tasks];". We are going to add auto-setting of parallelism in the next release.

安裝

0. 安裝Hadoop

話說在上一篇Spark 安裝教學裡面，完全沒有提到要安裝Hadoop，也不一定要安裝，但是在Shark 的Case 建議是一定要安裝Hadoop，因為Shark 使用的是Hive 的 lib ，用的是Hive 的Metastore，所以最簡單的測試方法就是安裝好Hive ，透過 Hive 把資料儲存到 HDFS 以及Metastore，之後再由Shark 存取分析(直接用Hive語法)，這樣還可以順便測試Hive和Shark的效能。

所以請各位自行安裝Hadoop(我是用CDH4 安裝YARN,hdfs,map-reduce這樣就夠了...)

1. 下載Shark (裡面會包含已經patch 好的一包Hive)

$ wget http://spark-project.org/download/shark-0.7.0-hadoop2-bin.tgz  # Hadoop 2/CDH4
$ tar xvfz shark-0.7.0-*-bin.tgz
$ mv shark-0.7.0 /usr/lib/
$ ln -s /usr/lib/shark-0.7.0 /usr/lib/shark
$ mv hive-0.9.0-bin /usr/lib
$ ln -s /usr/lib/hive-0.9.0-bin /usr/lib/hive

2. 設定Hive MetaSotre

2-1. 安裝mysql

2-2. 登入mysql 設定metastore schema

mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.9.0.mysql.sql;
mysql> CREATE USER 'hive'@'lab-hadoop-m1' IDENTIFIED BY 'my password';
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'lab-hadoop-m1';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'lab-hadoop-m1';
mysql> FLUSH PRIVILEGES;
mysql> quit;

3. 設定Hive /usr/lib/hive/conf/hive-site.xml

 

  javax.jdo.option.ConnectionURL
  jdbc:mysql://lab-hadoop-m1/metastore
  JDBC connect string for a JDBC metastore



  javax.jdo.option.ConnectionDriverName
  com.mysql.jdbc.Driver
  Driver class name for a JDBC metastore


  hive.metastore.uris
  thrift://lab-hadoop-m1:9083
  IP address (or fully-qualified domain name) and port of the metastore host




   fs.defaultFS
   hdfs://lab-hadoop-m2.tcloud:8020



    mapred.job.tracker
    lab-hadoop-m2.tcloud:8021

4. 設定Shark設定檔 /usr/lib/shark/conf/shark-env.sh

#!/usr/bin/env bash

# Copyright (C) 2012 The Regents of The University California.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# (Required) Amount of memory used per slave node. This should be in the same
# format as the JVM's -Xmx option, e.g. 300m or 1g.
export SPARK_MEM=1g

# (Required) Set the master program's memory
export SHARK_MASTER_MEM=1g

# (Required) Point to your Scala installation.
export SCALA_HOME=$SCALA_HOME

# (Required) Point to the patched Hive binary distribution
export HIVE_HOME=/usr/lib/hive
# (Optional) Specify the location of Hive's configuration directory. By default,
# it points to $HIVE_HOME/conf
#export HIVE_CONF_DIR="$HIVE_HOME/conf"

# For running Shark in distributed mode, set the following:
export HADOOP_HOME=/usr/lib/hadoop
export SPARK_HOME=/usr/lib/spark
export MASTER=spark://lab-hadoop-m1:7077
#export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so

# (Optional) Extra classpath
#export SPARK_LIBRARY_PATH=""

# Java options
# On EC2, change the local.dir to /mnt/tmp
SPARK_JAVA_OPTS="-Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS

當一切都設定好，在console 下打shark 就可以進入shark shell ~

[root@lab-hadoop-m1 ~]# shark

Starting the Shark Command Line Client
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in jar:file:/usr/lib/hive-0.9.0-bin/lib/hive-common-0.9.0-amplab-4.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201308131708_291740705.txt
shark>

如果出現這個畫面，應該就代表安裝好了，可以直接下hive的語法操作~

Referecne：
[1] Running Shark on a Cluster