How to install Apache Spark on Ubuntu 20.04
How to install Apache Spark on Ubuntu 20.04

How To Install Apache Spark On Ubuntu 20.04

Finding tools for data processing that suit the needs of our application is not an easy thing to do, especially as a beginner who wants to get into the world of data processing. With a minimal budget and abundant resources, of course, the choice will fall on the open source world, especially under the Apache project. This guide was created to help beginners installing Apache Spark on Ubuntu 20.04 LTS operating system.

Introduction

Apache Spark is an open-source multi-language analytics engine for running data engineering, data science and machine learning on a single machine or cluster of nodes. Apache Spark provides in-memory technologies batch data processing and real-time streaming that allow it to store queries and data directly in the main memory of the cluster nodes, Apache Spark has support of high-level APIs in languages such as Python, SQL, Scala, Java or R.

Finding tools for data processing that suit the needs of our application is not an easy thing to do, especially as a beginner who wants to get into the world of data processing. With a minimal budget and abundant resources, of course, the choice will fall on the open source world, especially under the Apache project. This guide was created to help beginners installing Apache Spark on Ubuntu 20.04 LTS operating system.

Apache Spark Installation on Ubuntu 20.04 LTS

This guidance targets beginners who want to try installing Apache Spark on an Ubuntu 20.04 LTS machine. However, as a beginner, you should be familiar with some Linux command lines that will be used.

Before we proceed to the installation process, there are several prerequisites that must be met, namely:

  1. Updated Ubuntu 20.04 Server.
  2. Non-root user with sudo access.
  3. Sufficient disk space to accommodate files and installation
  4. Good network connection to download source files

After all the prerequisites are met, let’s start the installation process.

1. Install Java Runtime and other dependencies packages

Apache Spark requires Java to run on the system, so we have to make sure if our system already has java and running normally.

1.1 Installing Java

$ sudo apt install default-jdk -y
$ java -version

1.2 Installing Scala

$ sudo apt-get install scala
$ scala

1.3 Installing curl

 $ sudo apt install curl

2. Install Apache Spark

On this stage, we will download and extract and make Apache Spark from the source to our local system.

2.1 Download Apache Spark

On this tutorial, we will use root user for downloading and extracting Apache Spark source file. We will be uing Spark version 2.4.5. We will be using /opt/spark mount point as our default directory for Apache Spark installation.

$ mkdir /opt/spark
$ cd /opt/spark
root@bckinfo:/opt# wget http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

The output will be as follow :

--2020-05-28 21:06:20--  http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
Connecting to apachemirror.wuchna.com (apachemirror.wuchna.com)|159.65.154.237|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 232530699 (222M), 18094043 (17M) remaining [application/x-gzip]
Saving to: ‘spark-2.4.5-bin-hadoop2.7.tgz.1’

spark-2.4.5-bin-hadoop2.7.tgz.1 100%[++++++++++++++++++++++++++++++++++++++++++++++++++====>] 221.76M  2.30MB/s    in 7.4s    

2020-05-28 21:07:51 (2.32 MB/s) - ‘spark-2.4.5-bin-hadoop2.7.tgz.1’ saved [232530699/232530699]

2.2 Extracting Source File

We just extract it by submitting command line :

$ tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz

The output will be as follow :

root@bckinfo:/opt# tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz
spark-2.4.5-bin-hadoop2.7/
spark-2.4.5-bin-hadoop2.7/licenses/
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-jtransforms.html
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-zstd.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-zstd-jni.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-xmlenc.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-vis.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-spire.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-sorttable.js.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-slf4j.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-scopt.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-scala.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-sbt-launch-lib.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-respond.txt
...
spark-2.4.5-bin-hadoop2.7/data/mllib/gmm_data.txt
spark-2.4.5-bin-hadoop2.7/data/mllib/als/
spark-2.4.5-bin-hadoop2.7/data/mllib/als/test.data
spark-2.4.5-bin-hadoop2.7/data/mllib/als/sample_movielens_ratings.txt
spark-2.4.5-bin-hadoop2.7/data/graphx/
spark-2.4.5-bin-hadoop2.7/data/graphx/users.txt
spark-2.4.5-bin-hadoop2.7/data/graphx/followers.txt
spark-2.4.5-bin-hadoop2.7/NOTICE

We will list all files, the result is as shown below :

root@bckinfo:/opt# ls -ltr
total 227100
drwxr-xr-x 13 ramans ramans      4096 Feb  2 11:47 spark-2.4.5-bin-hadoop2.7
-rw-r--r--  1 root   root   232530699 Feb  2 12:27 spark-2.4.5-bin-hadoop2.7.tgz
drwx--x--x  4 root   root        4096 May 22 19:08 containerd
drwxr-xr-x  2 root   root        4096 May 28 20:53 spark
root@bckinfo:/opt# chmod -R 777 /opt/spark

Edit the bashrc configuration file to add Apache Spark installation directory to the system path.

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save the file and take the effect, by submitting command line :

root@bckinfo:/opt# source ~/.bashrc

2.3 Starting Standalone Master Server

On this step, we will use an non-root user. To start up Stand alone master server, we will submit the command line :

$ ./sbin/start-master.sh

The output will be as shown below :

ramans@bckinfo:/opt/spark/spark-2.4.5-bin-hadoop2.7$ ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/spark-2.4.5-bin-hadoop2.7/logs/spark-ramans-org.apache.spark.deploy.master.Master-1-bckinfo.out
ramans@bckinfo:/opt/spark/spark-2.4.5-bin-hadoop2.7$ netstat -ant | grep 8080
tcp6       0      0 :::8080                 :::*                    LISTEN     

2.4 Accessing Apache Spark Web Interface

The final step on this tutorial is to access Apache Spark web interface, by hitting the URL http://ip_address_server:8080.

Conclusion

On this tutorial, we have learnt how to install Apache Spark on Ubuntu 20.04 LTS operating system.

(Visited 480 times, 2 visits today)

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *