How To Install Apache Spark On Ubuntu 20.04

July 17, 2022

Finding tools for data processing that suit the needs of our application is not an easy thing to do, especially as a beginner who wants to get into the world of data processing. With a minimal budget and abundant resources, of course, the choice will fall on the open source world, especially under the Apache project. This guide was created to help beginners installing Apache Spark on Ubuntu 20.04 LTS operating system.

Introduction
Prerequisites
Installing Mosquitto on Ubuntu 20.04
Starting and Enabling the Mosquitto Service
Installing Mosquitto Clients and Testing
Basic Publish/Subscribe Examples
Securing the Mosquitto Broker

Creating a Password File
Configuring Mosquitto Authentication
Restarting Mosquitto with Security Enabled

8. Conclusion

Introduction

Apache Spark is an open-source multi-language analytics engine for running data engineering, data science and machine learning on a single machine or cluster of nodes. Apache Spark provides in-memory technologies batch data processing and real-time streaming that allow it to store queries and data directly in the main memory of the cluster nodes, Apache Spark has support of high-level APIs in languages such as Python, SQL, Scala, Java or R.

Apache Spark Installation on Ubuntu 20.04 LTS

This guidance targets beginners who want to try installing Apache Spark on an Ubuntu 20.04 LTS machine. However, as a beginner, you should be familiar with some Linux command lines that will be used.

Before we proceed to the installation process, there are several prerequisites that must be met, namely:

Updated Ubuntu 20.04 Server.
Non-root user with sudo access.
Sufficient disk space to accommodate files and installation
Good network connection to download source files

After all the prerequisites are met, let’s start the installation process. The Apache Spark installation will be consist of following steps :

Install Java Runtime and other dependencies packages
Install Apache Spark

The details of Spark installation will be described below.

Step 1: Install Java Runtime and other dependencies packages

Apache Spark requires Java to run on the system, so we have to make sure if our system already has java and running normally.

1.1 Installing Java

$ sudo apt install default-jdk -y
$ java -version

1.2 Installing Scala

$ sudo apt-get install scala
$ scala

1.3 Installing curl

 $ sudo apt install curl

Step 2: Install Apache Spark

On this stage, we will download and extract and make Apache Spark from the source to our local system.

2.1 Download Apache Spark

On this tutorial, we will use root user for downloading and extracting Apache Spark source file. We will be uing Spark version 2.4.5. We will be using /opt/spark mount point as our default directory for Apache Spark installation.

$ mkdir /opt/spark
$ cd /opt/spark

root@bckinfo:/opt# wget http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

The output will be as follow :

--2020-05-28 21:06:20--  http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
Connecting to apachemirror.wuchna.com (apachemirror.wuchna.com)|159.65.154.237|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 232530699 (222M), 18094043 (17M) remaining [application/x-gzip]
Saving to: ‘spark-2.4.5-bin-hadoop2.7.tgz.1’

spark-2.4.5-bin-hadoop2.7.tgz.1 100%[++++++++++++++++++++++++++++++++++++++++++++++++++====>] 221.76M  2.30MB/s    in 7.4s    

2020-05-28 21:07:51 (2.32 MB/s) - ‘spark-2.4.5-bin-hadoop2.7.tgz.1’ saved [232530699/232530699]

2.2 Extracting Source File

We just extract it by submitting command line :

$ tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz

The output will be as follow :

root@bckinfo:/opt# tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz
spark-2.4.5-bin-hadoop2.7/
spark-2.4.5-bin-hadoop2.7/licenses/
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-jtransforms.html
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-zstd.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-zstd-jni.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-xmlenc.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-vis.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-spire.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-sorttable.js.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-slf4j.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-scopt.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-scala.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-sbt-launch-lib.txt
spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-respond.txt
...
spark-2.4.5-bin-hadoop2.7/data/mllib/gmm_data.txt
spark-2.4.5-bin-hadoop2.7/data/mllib/als/
spark-2.4.5-bin-hadoop2.7/data/mllib/als/test.data
spark-2.4.5-bin-hadoop2.7/data/mllib/als/sample_movielens_ratings.txt
spark-2.4.5-bin-hadoop2.7/data/graphx/
spark-2.4.5-bin-hadoop2.7/data/graphx/users.txt
spark-2.4.5-bin-hadoop2.7/data/graphx/followers.txt
spark-2.4.5-bin-hadoop2.7/NOTICE

We will list all files, the result is as shown below :

root@bckinfo:/opt# ls -ltr
total 227100
drwxr-xr-x 13 ramans ramans      4096 Feb  2 11:47 spark-2.4.5-bin-hadoop2.7
-rw-r--r--  1 root   root   232530699 Feb  2 12:27 spark-2.4.5-bin-hadoop2.7.tgz
drwx--x--x  4 root   root        4096 May 22 19:08 containerd
drwxr-xr-x  2 root   root        4096 May 28 20:53 spark
root@bckinfo:/opt# chmod -R 777 /opt/spark

Edit the bashrc configuration file to add Apache Spark installation directory to the system path.

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save the file and take the effect, by submitting command line :

root@bckinfo:/opt# source ~/.bashrc

2.3 Starting Standalone Master Server

On this step, we will use an non-root user. To start up Stand alone master server, we will submit the command line :

$ ./sbin/start-master.sh

The output will be as shown below :

ramans@bckinfo:/opt/spark/spark-2.4.5-bin-hadoop2.7$ ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/spark-2.4.5-bin-hadoop2.7/logs/spark-ramans-org.apache.spark.deploy.master.Master-1-bckinfo.out
ramans@bckinfo:/opt/spark/spark-2.4.5-bin-hadoop2.7$ netstat -ant | grep 8080
tcp6       0      0 :::8080                 :::*                    LISTEN

2.4 Accessing Apache Spark Web Interface

The final step on this tutorial is to access Apache Spark web interface, by hitting the URL http://ip_address_server:8080.

To find out the Apache Spark version running on your system, can be found on this article How To Check the Apache Spark Version. This helps ensure you’re using the most up-to-date and secure release.

Conclusion

By following this tutorial, you’ve successfully learned how to install and configure Apache Spark on Ubuntu 20.04 LTS. This setup will help you run and manage big data applications efficiently on your Linux server.

(Visited 875 times, 1 visits today)

How To Install Apache Spark On Ubuntu 20.04

Table of Contents

Introduction

Apache Spark Installation on Ubuntu 20.04 LTS

Step 1: Install Java Runtime and other dependencies packages

Step 2: Install Apache Spark

2.1 Download Apache Spark

2.2 Extracting Source File

2.3 Starting Standalone Master Server

2.4 Accessing Apache Spark Web Interface

Conclusion

Leave a Reply Cancel reply

You may also like

Ads

Search

Ads

Related

Recent Posts

Ads

🧭 PostgreSQL Tutorial

Table of Contents

Introduction

Apache Spark Installation on Ubuntu 20.04 LTS

Step 1: Install Java Runtime and other dependencies packages

Step 2: Install Apache Spark

2.1 Download Apache Spark

2.2 Extracting Source File

2.3 Starting Standalone Master Server

2.4 Accessing Apache Spark Web Interface

Conclusion

Related posts:

Leave a Reply Cancel reply

You may also like

How to Install Apache Iceberg on Docker: A Complete Guide

Installing Monit On Fedora 38

Ads

Search

Ads

Related

Recent Posts

Ads

🧭 PostgreSQL Tutorial