Finding tools for data processing that suit the needs of our application is not an easy thing to do, especially as a beginner who wants to get into the world of data processing. With a minimal budget and abundant resources, of course, the choice will fall on the open source world, especially under the Apache project. This guide was created to help beginners installing Apache Spark on Ubuntu 20.04 LTS operating system.
Introduction
Apache Spark is an open-source multi-language analytics engine for running data engineering, data science and machine learning on a single machine or cluster of nodes. Apache Spark provides in-memory technologies batch data processing and real-time streaming that allow it to store queries and data directly in the main memory of the cluster nodes, Apache Spark has support of high-level APIs in languages such as Python, SQL, Scala, Java or R.
Finding tools for data processing that suit the needs of our application is not an easy thing to do, especially as a beginner who wants to get into the world of data processing. With a minimal budget and abundant resources, of course, the choice will fall on the open source world, especially under the Apache project. This guide was created to help beginners installing Apache Spark on Ubuntu 20.04 LTS operating system.
Apache Spark Installation on Ubuntu 20.04 LTS
This guidance targets beginners who want to try installing Apache Spark on an Ubuntu 20.04 LTS machine. However, as a beginner, you should be familiar with some Linux command lines that will be used.
Before we proceed to the installation process, there are several prerequisites that must be met, namely:
- Updated Ubuntu 20.04 Server.
- Non-root user with sudo access.
- Sufficient disk space to accommodate files and installation
- Good network connection to download source files
After all the prerequisites are met, let’s start the installation process.
1. Install Java Runtime and other dependencies packages
Apache Spark requires Java to run on the system, so we have to make sure if our system already has java and running normally.
1.1 Installing Java
$ sudo apt install default-jdk -y $ java -version
1.2 Installing Scala
$ sudo apt-get install scala $ scala
1.3 Installing curl
$ sudo apt install curl
2. Install Apache Spark
On this stage, we will download and extract and make Apache Spark from the source to our local system.
2.1 Download Apache Spark
On this tutorial, we will use root user for downloading and extracting Apache Spark source file. We will be uing Spark version 2.4.5. We will be using /opt/spark mount point as our default directory for Apache Spark installation.
$ mkdir /opt/spark $ cd /opt/spark
root@bckinfo:/opt# wget http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
The output will be as follow :
--2020-05-28 21:06:20-- http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz Connecting to apachemirror.wuchna.com (apachemirror.wuchna.com)|159.65.154.237|:80... connected. HTTP request sent, awaiting response... 206 Partial Content Length: 232530699 (222M), 18094043 (17M) remaining [application/x-gzip] Saving to: ‘spark-2.4.5-bin-hadoop2.7.tgz.1’ spark-2.4.5-bin-hadoop2.7.tgz.1 100%[++++++++++++++++++++++++++++++++++++++++++++++++++====>] 221.76M 2.30MB/s in 7.4s 2020-05-28 21:07:51 (2.32 MB/s) - ‘spark-2.4.5-bin-hadoop2.7.tgz.1’ saved [232530699/232530699]
2.2 Extracting Source File
We just extract it by submitting command line :
$ tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz
The output will be as follow :
root@bckinfo:/opt# tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz spark-2.4.5-bin-hadoop2.7/ spark-2.4.5-bin-hadoop2.7/licenses/ spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-jtransforms.html spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-zstd.txt spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-zstd-jni.txt spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-xmlenc.txt spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-vis.txt spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-spire.txt spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-sorttable.js.txt spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-slf4j.txt spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-scopt.txt spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-scala.txt spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-sbt-launch-lib.txt spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-respond.txt ... spark-2.4.5-bin-hadoop2.7/data/mllib/gmm_data.txt spark-2.4.5-bin-hadoop2.7/data/mllib/als/ spark-2.4.5-bin-hadoop2.7/data/mllib/als/test.data spark-2.4.5-bin-hadoop2.7/data/mllib/als/sample_movielens_ratings.txt spark-2.4.5-bin-hadoop2.7/data/graphx/ spark-2.4.5-bin-hadoop2.7/data/graphx/users.txt spark-2.4.5-bin-hadoop2.7/data/graphx/followers.txt spark-2.4.5-bin-hadoop2.7/NOTICE
We will list all files, the result is as shown below :
root@bckinfo:/opt# ls -ltr total 227100 drwxr-xr-x 13 ramans ramans 4096 Feb 2 11:47 spark-2.4.5-bin-hadoop2.7 -rw-r--r-- 1 root root 232530699 Feb 2 12:27 spark-2.4.5-bin-hadoop2.7.tgz drwx--x--x 4 root root 4096 May 22 19:08 containerd drwxr-xr-x 2 root root 4096 May 28 20:53 spark root@bckinfo:/opt# chmod -R 777 /opt/spark
Edit the bashrc
configuration file to add Apache Spark installation directory to the system path.
export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Save the file and take the effect, by submitting command line :
root@bckinfo:/opt# source ~/.bashrc
2.3 Starting Standalone Master Server
On this step, we will use an non-root user. To start up Stand alone master server, we will submit the command line :
$ ./sbin/start-master.sh
The output will be as shown below :
ramans@bckinfo:/opt/spark/spark-2.4.5-bin-hadoop2.7$ ./sbin/start-master.sh starting org.apache.spark.deploy.master.Master, logging to /opt/spark/spark-2.4.5-bin-hadoop2.7/logs/spark-ramans-org.apache.spark.deploy.master.Master-1-bckinfo.out ramans@bckinfo:/opt/spark/spark-2.4.5-bin-hadoop2.7$ netstat -ant | grep 8080 tcp6 0 0 :::8080 :::* LISTEN
2.4 Accessing Apache Spark Web Interface
The final step on this tutorial is to access Apache Spark web interface, by hitting the URL http://ip_address_server:8080
.
Conclusion
On this tutorial, we have learnt how to install Apache Spark on Ubuntu 20.04 LTS operating system.