How To Install Apache Spark On Ubuntu 20.04

How to install Apache Spark on Ubuntu 20.04

Finding tools for data processing that suit the needs of our application is not an easy thing to do, especially as a beginner who wants to get into the world of data processing. With a minimal budget and abundant resources, of course, the choice will fall on the open source world, especially under the Apache project. This guide was created to help beginners installing Apache Spark on Ubuntu 20.04 LTS operating system.

Introduction

Apache Spark is an open-source multi-language analytics engine for running data engineering, data science and machine learning on a single machine or cluster of nodes. Apache Spark provides in-memory technologies batch data processing and real-time streaming that allow it to store queries and data directly in the main memory of the cluster nodes, Apache Spark has support of high-level APIs in languages such as Python, SQL, Scala, Java or R.

Finding tools for data processing that suit the needs of our application is not an easy thing to do, especially as a beginner who wants to get into the world of data processing. With a minimal budget and abundant resources, of course, the choice will fall on the open source world, especially under the Apache project. This guide was created to help beginners installing Apache Spark on Ubuntu 20.04 LTS operating system.

Apache Spark Installation on Ubuntu 20.04 LTS

This guidance targets beginners who want to try installing Apache Spark on an Ubuntu 20.04 LTS machine. However, as a beginner, you should be familiar with some Linux command lines that will be used.

Before we proceed to the installation process, there are several prerequisites that must be met, namely:

  • Updated Ubuntu 20.04 Server.
  • Non-root user with sudo access.
  • Sufficient disk space to accommodate files and installation
  • Good network connection to download source files

    After all the prerequisites are met, let’s start the installation process. The Apache Spark installation will be consist of following steps :

    1. Install Java Runtime and other dependencies packages
    2. Install Apache Spark

    The details of Spark installation will be described below.

    Step 1: Install Java Runtime and other dependencies packages

    Apache Spark requires Java to run on the system, so we have to make sure if our system already has java and running normally.

    1.1 Installing Java

    $ sudo apt install default-jdk -y
    $ java -version

    1.2 Installing Scala

    $ sudo apt-get install scala
    $ scala

    1.3 Installing curl

     $ sudo apt install curl

    Step 2: Install Apache Spark

    On this stage, we will download and extract and make Apache Spark from the source to our local system.

    2.1 Download Apache Spark

    On this tutorial, we will use root user for downloading and extracting Apache Spark source file. We will be uing Spark version 2.4.5. We will be using /opt/spark mount point as our default directory for Apache Spark installation.

    $ mkdir /opt/spark
    $ cd /opt/spark
    root@bckinfo:/opt# wget http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

    The output will be as follow :

    --2020-05-28 21:06:20--  http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
    http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
    Connecting to apachemirror.wuchna.com (apachemirror.wuchna.com)|159.65.154.237|:80... connected.
    HTTP request sent, awaiting response... 206 Partial Content
    Length: 232530699 (222M), 18094043 (17M) remaining [application/x-gzip]
    Saving to: ‘spark-2.4.5-bin-hadoop2.7.tgz.1’
    
    spark-2.4.5-bin-hadoop2.7.tgz.1 100%[++++++++++++++++++++++++++++++++++++++++++++++++++====>] 221.76M  2.30MB/s    in 7.4s    
    
    2020-05-28 21:07:51 (2.32 MB/s) - ‘spark-2.4.5-bin-hadoop2.7.tgz.1’ saved [232530699/232530699]

    2.2 Extracting Source File

    We just extract it by submitting command line :

    $ tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz

    The output will be as follow :

    root@bckinfo:/opt# tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz
    spark-2.4.5-bin-hadoop2.7/
    spark-2.4.5-bin-hadoop2.7/licenses/
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-jtransforms.html
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-zstd.txt
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-zstd-jni.txt
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-xmlenc.txt
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-vis.txt
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-spire.txt
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-sorttable.js.txt
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-slf4j.txt
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-scopt.txt
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-scala.txt
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-sbt-launch-lib.txt
    spark-2.4.5-bin-hadoop2.7/licenses/LICENSE-respond.txt
    ...
    spark-2.4.5-bin-hadoop2.7/data/mllib/gmm_data.txt
    spark-2.4.5-bin-hadoop2.7/data/mllib/als/
    spark-2.4.5-bin-hadoop2.7/data/mllib/als/test.data
    spark-2.4.5-bin-hadoop2.7/data/mllib/als/sample_movielens_ratings.txt
    spark-2.4.5-bin-hadoop2.7/data/graphx/
    spark-2.4.5-bin-hadoop2.7/data/graphx/users.txt
    spark-2.4.5-bin-hadoop2.7/data/graphx/followers.txt
    spark-2.4.5-bin-hadoop2.7/NOTICE
    

    We will list all files, the result is as shown below :

    root@bckinfo:/opt# ls -ltr
    total 227100
    drwxr-xr-x 13 ramans ramans      4096 Feb  2 11:47 spark-2.4.5-bin-hadoop2.7
    -rw-r--r--  1 root   root   232530699 Feb  2 12:27 spark-2.4.5-bin-hadoop2.7.tgz
    drwx--x--x  4 root   root        4096 May 22 19:08 containerd
    drwxr-xr-x  2 root   root        4096 May 28 20:53 spark
    root@bckinfo:/opt# chmod -R 777 /opt/spark
    

    Edit the bashrc configuration file to add Apache Spark installation directory to the system path.

    export SPARK_HOME=/opt/spark
    export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

    Save the file and take the effect, by submitting command line :

    root@bckinfo:/opt# source ~/.bashrc

    2.3 Starting Standalone Master Server

    On this step, we will use an non-root user. To start up Stand alone master server, we will submit the command line :

    $ ./sbin/start-master.sh

    The output will be as shown below :

    ramans@bckinfo:/opt/spark/spark-2.4.5-bin-hadoop2.7$ ./sbin/start-master.sh
    starting org.apache.spark.deploy.master.Master, logging to /opt/spark/spark-2.4.5-bin-hadoop2.7/logs/spark-ramans-org.apache.spark.deploy.master.Master-1-bckinfo.out
    ramans@bckinfo:/opt/spark/spark-2.4.5-bin-hadoop2.7$ netstat -ant | grep 8080
    tcp6       0      0 :::8080                 :::*                    LISTEN     
    

    2.4 Accessing Apache Spark Web Interface

    The final step on this tutorial is to access Apache Spark web interface, by hitting the URL http://ip_address_server:8080.

    Conclusion

    On this tutorial, we have learnt how to install Apache Spark on Ubuntu 20.04 LTS operating system.

    (Visited 578 times, 1 visits today)

    You may also like