NASDAG.org

A data scientist blog, by Philippe Dagher

Step by Step Installation of a Local Data Lake (1/3)

This post will guide you through a step by step installation and configuration of a Local Data Lake on ubuntu with packages such as Hadoop, Hive, Spark, Thriftserver, Maven, Scala, Python, Jupyter and Zeppelin.

It is the first of a series of 3 posts that will allow you to familiarize with state of the art tools to practice Data Science on Big Data.

In the first post we will setup the environment on ubuntu using a cloud host or a virtual machine. In the second post we will crunch incoming data and expose it to data mining and machine learning tools. In the third post, we will apply machine learning and data science techniques to conventional business cases.

You will need to install ubuntu 15.10 on a virtual machine, either locally on a PC with at least 8Gb of RAM or on the cloud with DigitalOcean, Azure or AWS with at least 4Gb of RAM. If you choose to continue on a local machine, we will be installing and configuring IntelliJ IDEA as well in our Data Lake environment to code with scala. I have tested what will follow on both a DigitalOcean droplet 15.10x64 and a VMware Workstation 12 Pro running ubuntu-15.10-desktop-amd64.iso (if you prefer to download a preconfigured virtual machine just follow this link - password: ghghgh).

Once you get your host booted, make sure that you have access as root and as another user with root privileges - that I will call nasdag in this turorial. Otherwise from your root account add this user:

1
adduser nasdag

Grant nasdag root privileges: type visudo, this will open up nano to edit the sudoers file. Find the line that says root ALL=(ALL:ALL) ALL and give nasdag a line beneath that which says nasdag ALL=(ALL:ALL) ALL. Save by hitting Ctrl-o and then Enter when asked for the file name. Exit nano with Ctrl-x.

If you have an account with root privileges, you can execute a command as root by preceding it with sudo or operate as root with the following command: sudo su -

Now stop being root and start being nasdag using su - nasdag from your root account or logging directly to your host as nasdag.

We will start by installing some basics. Make sure that your system is up-to-date:

1
sudo apt-get update

Git and ssh

Let’s synchronize first with your GitHub account. Install git, configure with your e-mail and name; generate a public key:

1
2
3
4
5
6
7
sudo apt-get install git

git config --global user.email your@email.address
git config --global user.name "Your Username"

ssh-keygen  -t rsa -P ""
cat ~/.ssh/id_rsa.pub

Copy-paste it into GitHub.com / Personal Settings / SSH Keys / Add SSH Key.

Verify your access: ssh -T git@github.com

We will need later nasdag to connect with ssh to localhost so let’s add the key to recognized keys and install the ssh server:

1
2
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
sudo apt-get install openssh-server

Test with: ssh localhost

Python

Then we will install python with some basic packages; sklearn; textblob; jupyter notebook; pyhs2 which we will use to test the JDBC connection to Hive:

1
2
3
4
5
6
7
8
9
sudo apt-get install python-pip
sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose
sudo pip install textblob
python -m textblob.download_corpora
sudo pip install --upgrade ipython
sudo pip install jupyter
sudo apt-get install libsasl2-dev
sudo pip install sasl
sudo pip install pyhs2

Let’s now secure the connection to the ipython notebook.

Prepare a hashed password with ipython:

1
2
3
4
5
6
ipython
In [1]: from IPython.lib import passwd
In [2]: passwd()
Enter password:
Verify password:
Out[2]: 'sha1:c7a1db8c1db8:67e3xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

Create a certificate valid for 365 days with both the key and certificate data written to the same file:

1
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

Generate a configuration file and input the configuration as listed below:

1
2
3
4
5
jupyter notebook --generate-config
mkdir -p ~/tutorials
cd ~/tutorials
git clone http://github.com/nasdag/pyspark
vi ~/.jupyter/jupyter_notebook_config.py

add the following lines at the beginning of the jupyter_notebook_config.py file

1
2
3
4
5
6
7
8
9
10
c = get_config()
c.IPKernelApp.pylab = 'inline'  # if you want plotting support always
c.NotebookApp.certfile = u'/home/nasdag/mycert.pem'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:c7a1db8c1db8:67e3xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
c.NotebookApp.port = 4334
c.NotebookApp.base_url = '/pyspak/'
c.NotebookApp.webapp_settings = {'static_url_prefix':'/pyspark/static/'}
c.NotebookApp.notebook_dir = '/home/nasdag/tutorials/pyspark/'

We will need later to edit ~/.ipython/profile_default/startup/initspark.py to include the path to pyspark.

Test ipython notebook; browse to https://host_ip_address:4334/pyspark/. Do not run the test notebooks that you downloaded from GitHub now, as you need to start other services as explained later.

Java 7

1
2
3
4
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-jdk7-installer

Test your version: java -version

Mysql

1
2
sudo apt-get install mysql-server
sudo apt-get install libmysql-java

We need to prepare a metastore database for Hive. Download Hive from https://hive.apache.org/downloads.html and get hive-schema-1.2.0.mysql.sql and hive-txn-schema-0.13.0.derby.sql

1
2
3
4
5
6
7
8
9
10
11
12
13
wget http://apache.crihan.fr/dist/hive/stable/apache-hive-1.2.1-bin.tar.gz
tar -zxvf apache-hive-1.2.1-bin.tar.gz apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-schema-1.2.0.mysql.sql apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-txn-schema-0.13.0.mysql.sql
cd apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/

mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE hive-schema-1.2.0.mysql.sql;
mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'hivepassword'; 
mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'hivepassword';
mysql> flush privileges;
mysql> exit;

You can delete now all downloaded Hive files as we will no longer use them: cd; rm -r apache-hive-1.2.1-bin*.

Scala

Go to http://scala-lang.org/ - Dowload - All downloads - and get version 2.10.6:

1
2
3
wget http://downloads.typesafe.com/scala/2.10.6/scala-2.10.6.tgz
sudo tar -xzf scala-2.10.6.tgz -C /usr/local/share
rm scala-2.10.6.tgz

Maven

Go to https://maven.apache.org/download.cgi - and get the latest version:

1
2
3
4
wget http://mirrors.ircam.fr/pub/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
sudo tar -xzf apache-maven-3.3.9-bin.tar.gz -C /usr/local/share
sudo mv /usr/local/share/apache-maven-3.3.9 /usr/local/share/maven-3.3.9
rm apache-maven-3.3.9-bin.tar.gz

Hadoop

Go to http://hadoop.apache.org/releases.html - and get the version 2.6.2:

1
2
3
4
5
6
wget http://apache.mirrors.ovh.net/ftp.apache.org/dist/hadoop/common/hadoop-2.6.2/hadoop-2.6.2.tar.gz
sudo tar -xzf hadoop-2.6.2.tar.gz -C /usr/local/share
rm hadoop-2.6.2.tar.gz
sudo chown -R nasdag:nasdag /usr/local/share/hadoop-2.6.2/
sudo mkdir /var/local/hadoop
sudo chown -R nasdag:nasdag /var/local/hadoop

Now we have to edit the configuration files:

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/core-site.xml and replace the content with:

1
2
3
4
5
6
7
8
9
10
11
12
13
<configuration>

<property>
<name>hadoop.tmp.dir</name>
<value>/var/local/hadoop/tmp</value>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>

</configuration>

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/mapred-site.xml and replace the content with:

1
2
3
4
5
6
7
8
<configuration>

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>

</configuration>

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/hdfs-site.xml and replace the content with:

1
2
3
4
5
6
7
8
<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

</configuration>

vi /usr/local/share/hadoop-2.6.2/etc/hadoop/hadoop-env.sh and add the following at the very end to tell hadoop where java 7 is:

1
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

It is time to set some env variables in .bashrc - we will do it now for all what is coming as well anticipating spark, zeppelin and ideaIC:

vi ~/.bashrc and add the following at the very end:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export SCALA_HOME=/usr/local/share/scala-2.10.6
export MAVEN_HOME=/usr/local/share/maven-3.3.9
export PATH=$PATH:$PATH:$MAVEN_HOME/bin:$SCALA_HOME/bin:/home/nasdag/idea-IC/bin/
export IBUS_ENABLE_SYNC_MODE=1

export HADOOP_HOME=/usr/local/share/hadoop-2.6.2
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

export SPARK_HOME=/usr/local/share/spark-1.5.2
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export HADOOP_USER_CLASSPATH_FIRST=true
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/

export PYSPARK_SUBMIT_ARGS="--packages com.databricks:spark-csv_2.11:1.1.0 pyspark-shell"

export PATH=$PATH:$PATH:/home/nasdag/zeppelin/bin

Exit and login as user nasdag one more time for the settings to be applied.

Format the HDFS (hadoop filesystem)

1
hdfs namenode -format

Spark

Go to http://spark.apache.org/downloads.html - and get a prebuilt version for Hadoop 2.6:

1
2
3
4
wget http://mirrors.ircam.fr/pub/apache/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
sudo tar -xzf spark-1.5.2-bin-hadoop2.6.tgz -C /usr/local/share
sudo mv /usr/local/share/spark-1.5.2-bin-hadoop2.6 /usr/local/share/spark-1.5.2
rm spark-1.5.2-bin-hadoop2.6.tgz

Allow nasdag to write in spark logs: sudo mkdir -p /usr/local/share/spark-1.5.2/logs; sudo chmod 777 /usr/local/share/spark-1.5.2/logs

Create hive-site.xml in conf folder with the configuration below sudo vi /usr/local/share/spark-1.5.2/conf/hive-site.xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<configuration>
   <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
      <description>metadata is stored in a MySQL server</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
      <description>MySQL JDBC driver class</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hiveuser</value>
      <description>user name for connecting to mysql server</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>hivepassword</value>
      <description>password for connecting to mysql server</description>
   </property>
</configuration>

sudo vi /usr/local/share/spark-1.5.2/conf/spark-defaults.conf and add the following to tell spark where jdbc connector for mysql is:

1
2
spark.driver.extraClassPath        /usr/share/java/mysql-connector-java.jar
spark.master                       local[2]

vi ~/.ipython/profile_default/startup/initspark.py and add the following to tell ipython where is pyspark:

1
2
3
import sys
sys.path.append('/usr/local/share/spark-1.5.2/python/')
sys.path.append('/usr/local/share/spark-1.5.2/python/lib/py4j-0.8.2.1-src.zip')

When starting a jupyter notebook, you need to initiate with:

1
2
import pyspark
sc = pyspark.SparkContext()

Of course, you need to start hadoop first: start-dfs.sh.

Zeppelin

Get the latest source from GitHub and compile it with Maven:

1
2
3
4
5
6
cd ~
git clone http://github.com/apache/incubator-zeppelin
mv incubator-zeppelin zeppelin
cd zeppelin
export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
mvn install -DskipTests -Dspark.version=1.5.2 -Dhadoop.version=2.6.2

vi zeppelin/conf/zeppelin-env.sh and input the following:

1
2
export SPARK_HOME=/usr/local/share/spark-1.5.2
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.1.0 --jars /usr/share/java/mysql-connector-java.jar"

You can run now the tutorial at http://host_ip_address:8080/. But first you need to start hadoop and the zeppelin daemon:

1
2
start-dfs.sh
zeppelin-daemon.sh start

Securing the access to Zeppelin is outside the scope of this post.

IntelliJ IDEA

Go to https://www.jetbrains.com/idea/download/ - and get the Community version for linux:

1
2
3
4
wget https://download.jetbrains.com/idea/ideaIC-15.0.2.tar.gz
tar -xzf ideaIC-15.0.2.tar.gz -C ~
mv ~/idea-IC-143.1184.17 ~/idea-IC
rm ideaIC-15.0.2.tar.gz

We will have to specify the scala version and the maven version that we are using in the setup.

Start Hadoop

start-dfs.shand check with jpsthe runnning servers.

Start Thriftserver

start-thriftserver.sh

Start Jupyter

You should have cloned already the tutorials from my GitHub git clone http://github.com/nasdag/pyspark - browse now to the notebooks https://host_ip_address:4334/pyspark/notebooks/test1.ipynb or https://host_ip_address:4334/pyspark/notebooks/test2.ipynb and follow the self explanatory testings …

Setting ideaIC

Launch ìdea.sh. Select I do not have a previous version - Skip All and Set defaults.

Configure - Plugins - Install Scala - Restart

Configure - Settings - Build/Build Tools - Maven - Maven Home Directory - /usr/local/share/maven-3.3.9

Configure - Project Defaults/Project Structure - Platform Settings - SDKs - Add New SDK - /usr/lib/jvm/java-7-oracle

Configure - Project Defaults/Project Structure - Project Settings - Project - Project SDK - 1.7

Create New Project - Maven - Project SDK 1.7 - … - Open Module Settings (F4) - Add Scala Support - /usr/local/share/scala-2.10.6 - … - main / New Directory / scala / Mark Directory As Sources Root - test / New Directory / scala / Mark Directory As Test Sources Root

Swap

If sudo swapon -s is empty then I suggest to create a swapfile of 4Gb:

1
2
3
4
5
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
sudo vi /etc/fstab

Add:

1
/swapfile   none    swap    sw    0   0

Continue with:

1
2
3
sudo sysctl vm.swappiness=10
sudo sysctl vm.vfs_cache_pressure=50
sudo vi /etc/sysctl.conf

Add:

1
2
vm.swappiness=10
vm.vfs_cache_pressure = 50

Cheers,

Philippe

http://linkedin.com/in/nasdag