Installing Apache Spark on CentOS Stream 9: A Beginner’s Tutorial

Apache Spark

On this short article we will show you how to install Apache Spark version 3.5.4 on CentOS Stream 9 operating system. This tutorial is suitable for beginner.

Introduction

Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. Renowned for its speed and versatility, Spark offers robust support for batch and stream processing, making it a preferred choice for handling massive datasets in real time or at scale. Its capability to run on various platforms, including Hadoop, Kubernetes, and standalone clusters, adds to its adaptability in diverse environments.

Apache Spark Installation on CentOS Stream 9

On this tutorial, we will use Apache Spark version 3.5.4 which was released 20 December 2024. Apache Spark 3.5.4 is the third maintenance release containing security and correctness fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly recommend all 3.5 users to upgrade to this stable release.

The installation will be consist of several steps, namely :

  1. Update CentIS Strean 9 Repository
  2. Install Java (OpenJDK)
  3. Download Apache Spark 5.4.3 Packages
  4. Extract and Move Apache Spark
  5. Configure Environment Variables
  6. Testing Apache Spark
  7. Starting Apache Spark Stand Alone Master

These steps will be briefly explained on next sections below.

Prerequisites

Before we are going to start installation, we will verify our prerequistes

  1. CentOS Stream Environment: A server running CentOS (preferably CentOS 7 or CentOS 8).
  2. Java Development Kit (JDK): Apache Spark requires Java 8 or higher.
  3. Hadoop (Optional): If you intend to use HDFS or YARN, ensure Hadoop is installed.
  4. Python: If you plan to use PySpark, install Python 3.

Step 1: Update_CentOS Stream 9 System

Before starting the installation, update the system packages:

sudo dnf update -y

Step 2: Install Java (OpenJDK)

Install OpenJDK if it’s not already installed, the OpenJDK installation tutorial has been discussed on OpenJDK Installation on CentOS Stream 9 : Beginner’s Guide article.

To verify Java installation on the system, we will submit the following command line :

java -version
Install OpenJDK on CentOS Stream 9

Step 3: Download Apache Spark 5.4.3 Packages

On this stage, we will download the latest Apache Spark source file from official Apache Spark website. For this purpose, we will use wget command line. We will download Apache Spark version 3.5.4.

sudo wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz

Output :

 [ramansah@node01 ~]$ sudo wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
--2024-12-26 17:27:19-- https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400879762 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.4-bin-hadoop3.tgz’

spark-3.5.4-bin-hadoop3.tg 100%[======================================>] 382.31M 1013KB/s in 4m 40s

2024-12-26 17:32:00 (1.36 MB/s) - ‘spark-3.5.4-bin-hadoop3.tgz’ saved [400879762/400879762]
Download Apache Spark 3.5.4

Step 4: Extract and Move Apache Spark

After Apache Spark package file was already downloaded, then we will extrat it and move it to the apropriate location. We will move it to /opt/spark directory. For these purpose, we will submit the following command lines :

tar -xvf spark-3.5.4-bin-hadoop3.tgz
sudo mv spark-3.5.4-bin-hadoop3 /opt/spark

Currently the Apache Spark packages has already been on /opt/spark/spark-3.5.4-bin-hadoop3 directory.

Step 5: Configure Apache Spark Environment Variables

Add Spark to the system’s PATH by editing the .bashrc or .bash_profile. For this purpose we will use vi text editory to update the file.

vi ~/.bashrc

then append the following lines :

export SPARK_HOME=/opt/spark/spark-3.5.4-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin
Apache Spark for Bash profile file

then apply the change, by submitting command line :

source ~/.bashrc

Step 6: Testing Apache Spark

After setting the environment, then we will test the Apache Spark by running the Spark Shell. Run the following command to start the Spark Shell.

spark-shell

the output will be shown below :

[ramansah@node01 spark-3.5.4-bin-hadoop3]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/26 18:01:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/12/26 18:01:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at http://node01.bckinfo:4041
Spark context available as 'sc' (master = local[*], app id = local-1735210911739).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.4
/_/

Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.13)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
Apache Spark 3.5.4 Scala

Step 7: Starting Apache Spark Stand Alone Master

In this section, we will start up the stand alon Spark server and monitor it via the provided web interface. To start up stand alone master server we will submit command line :

./start-master.sh

The output will be shown below :

[ramansah@node01 sbin]$ ./start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/spark-3.5.4-bin-hadoop3/logs/spark-ramansah-org.apache.spark.deploy.master.Master-1-node01.bckinfo.out

Then we also can monitor the port which is used by Apache Spark service by submitting command line :

sudo ss -tunelp | grep 8080

The output will be shown below :

[sudo] password for ramansah:
tcp LISTEN 0 1 *:8080 *:* users:((“java”,pid=3584,fd=268)) uid:1000 ino:47396 sk:9 cgroup:/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-dac35eff-9069-4fab-8fd4-2040562ba099.scope v6only:0 <->

And also we can verify if the Apache Spark service is running properly by submitting an URL from web browser as shown below. We will hit the http://<hostname_or_localhost>:8080.

Apache Spark Master running on port 8080

So far, we have installed Apache Spark 3.5.4 on CentOS Stream 9 successfully.

Conclusion

We have successfully installed Apache Spark version 3.5.4 on CentOS Stream 9. We can now use it for big data processing and analysis.

(Visited 60 times, 1 visits today)

You may also like