Apache Spark Install
[Spark Installation][Spark-install] following hortonworks guidelines to install Spark requires HDFS and Yarn. Install spark in Yarn cluster mode.
Resources:
[Tips and Tricks from Altic Scale][https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-2-2/)
module.exports = header: 'Spark2 Client Install', handler: ({options}) ->
Register
@registry.register 'hconfigure', 'ryba/lib/hconfigure'
@registry.register 'hdfs_mkdir', 'ryba/lib/hdfs_mkdir'
Identities
By default, the "spark" package create the following entries:
cat /etc/passwd | grep spark
spark:x:495:494:Spark:/var/lib/spark:/bin/bash
cat /etc/group | grep spark
spark:x:494:
@system.group header: 'Group', options.group
@system.user header: 'User', options.user
Spark Service Installation
Install the spark and python packages.
@call header: 'Packages', ->
@service
name: 'spark2'
@service
name: 'spark2-python'
## HDFS Layout
status = user_owner = group_owner = null
spark_yarn_jar = options.conf['spark.yarn.jar']
@system.execute
header: 'HDFS Layout'
cmd: mkcmd.hdfs options.hdfs_krb5_user, """
hdfs dfs -mkdir -p /apps/#{options.user.name}
hdfs dfs -chmod 755 /apps/#{options.user.name}
hdfs dfs -put -f /usr/hdp/current/spark-client/lib/spark-assembly-*.jar #{spark_yarn_jar}
hdfs dfs -chown #{options.user.name}:#{options.group.name} #{spark_yarn_jar}
hdfs dfs -chmod 644 #{spark_yarn_jar}
hdfs dfs -put /usr/hdp/current/spark-client/lib/spark-examples-*.jar /apps/#{options.user.name}/spark-examples.jar
hdfs dfs -chown -R #{options.user.name}:#{options.group.name} /apps/#{options.user.name}
"""
Spark Worker events log dir
@hdfs_mkdir
target: "/user/#{options.user.name}"
mode: 0o0755
user: options.user.name
group: options.group.name
krb5_user: options.hdfs_krb5_user
@hdfs_mkdir
target: options.conf['spark.eventLog.dir']
mode: 0o1777
user: options.user.name
group: options.group.name
krb5_user: options.hdfs_krb5_user
SSL
Installs SSL certificates for spark. At the moment of this writing, Spark supports SSL Only in akka mode and fs mode ( file sharing and date streaming). The web ui does not support SSL.
SSL must be configured on each node and configured for each component involved in communication using the particular protocol.
@call
header: 'JKS Keystore'
if: -> options.conf['spark.ssl.enabled'] is 'true'
, ->
@java.keystore_add
header: 'SSL'
keystore: options.conf['spark.ssl.keyStore']
storepass: options.conf['spark.ssl.keyStorePassword']
key: options.ssl.key.source
cert: options.ssl.cert.source
keypass: options.conf['spark.ssl.keyPassword']
name: options.ssl.key.name
local: options.ssl.key.local
@java.keystore_add
keystore: options.conf['spark.ssl.keyStore']
storepass: options.conf['spark.ssl.keyStorePassword']
caname: 'hadoop_root_ca'
cacert: options.ssl.cacert.source
local: options.ssl.cacert.local
@java.keystore_add
header: 'JKS Truststore'
keystore: options.conf['spark.ssl.trustStore']
storepass: options.conf['spark.ssl.trustStorePassword']
caname: 'hadoop_root_ca'
cacert: options.ssl.cacert.source
local: options.ssl.cacert.local
Configuration files
Configure en environment file /etc/spark/conf/spark-env.sh and /etc/spark/conf/spark-defaults.conf Set the version of the hadoop cluster to the latest one. Yarn cluster mode supports starting to 2.2.2-4 Set Spark configuration variables The spark.logEvent.enabled property is set to true to enable the log to be available after the job has finished (logs are only available in yarn-cluster mode).
@call header: 'Configure', ->
hdp_current_version = null
@system.execute
cmd: "hdp-select versions | tail -1"
, (err, {status, stdout, stderr}) ->
return err if err
hdp_current_version = stdout.trim() if status
options.conf['spark.driver.extraJavaOptions'] ?= "-Dhdp.version=#{hdp_current_version}"
options.conf['spark.yarn.am.extraJavaOptions'] ?= "-Dhdp.version=#{hdp_current_version}"
options.version = hdp_current_version
@call ->
@file
target: "#{options.conf_dir}/java-opts"
content: "-Dhdp.version=#{hdp_current_version}"
@hconfigure
header: 'Hive Site'
target: "#{options.conf_dir}/hive-site.xml"
source: "/etc/hive/conf/hive-site.xml"
properties: options.hive_site
backup: true
@file.render
header: 'spark-env'
target: "#{options.conf_dir}/spark-env.sh"
source: "#{__dirname}/../resources/spark-env.sh.j2"
local: true
context: options: options
backup: true
@file.properties
target: "#{options.conf_dir}/spark-defaults.conf"
content: options.conf
merge: true
separator: ' '
@file
if: options.conf['spark.metrics.conf']
target: "#{options.conf_dir}/metrics.properties"
write: for k, v of options.metrics
match: ///^#{quote k}=.*$///mg
replace: if v is null then "" else "#{k}=#{v}"
append: v isnt null
backup: true
Dependencies
mkcmd = require '../../lib/mkcmd'
quote = require 'regexp-quote'
string = require 'nikita/lib/misc/string'