Installing Apache Spark on CentOS Stream 9: A Beginner’s Tutorial

December 26, 2024

On this short article we will show you how to install Apache Spark version 3.5.4 on CentOS Stream 9 operating system. This tutorial is suitable for beginner.

Introduction

Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. Renowned for its speed and versatility, Spark offers robust support for batch and stream processing, making it a preferred choice for handling massive datasets in real time or at scale. Its capability to run on various platforms, including Hadoop, Kubernetes, and standalone clusters, adds to its adaptability in diverse environments.

Apache Spark Installation on CentOS Stream 9

On this tutorial, we will use Apache Spark version 3.5.4 which was released 20 December 2024. Apache Spark 3.5.4 is the third maintenance release containing security and correctness fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly recommend all 3.5 users to upgrade to this stable release.

The installation will be consist of several steps, namely :

These steps will be briefly explained on next sections below.

Prerequisites

Before we are going to start installation, we will verify our prerequistes

CentOS Stream Environment: A server running CentOS (preferably CentOS 7 or CentOS 8).
Java Development Kit (JDK): Apache Spark requires Java 8 or higher.
Hadoop (Optional): If you intend to use HDFS or YARN, ensure Hadoop is installed.
Python: If you plan to use PySpark, install Python 3.

Step 1: Update_CentOS Stream 9 System

Before starting the installation, update the system packages:

sudo dnf update -y

Step 2: Install Java (OpenJDK)

Install OpenJDK if it’s not already installed, the OpenJDK installation tutorial has been discussed on OpenJDK Installation on CentOS Stream 9 : Beginner’s Guide article.

To verify Java installation on the system, we will submit the following command line :

java -version

Step 3: Download Apache Spark 5.4.3 Packages

On this stage, we will download the latest Apache Spark source file from official Apache Spark website. For this purpose, we will use wget command line. We will download Apache Spark version 3.5.4.

sudo wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz

Output :

 [ramansah@node01 ~]$ sudo wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
--2024-12-26 17:27:19--  https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400879762 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.4-bin-hadoop3.tgz’

spark-3.5.4-bin-hadoop3.tg 100%[======================================>] 382.31M  1013KB/s    in 4m 40s  

2024-12-26 17:32:00 (1.36 MB/s) - ‘spark-3.5.4-bin-hadoop3.tgz’ saved [400879762/400879762]

Step 4: Extract and Move Apache Spark

After Apache Spark package file was already downloaded, then we will extrat it and move it to the apropriate location. We will move it to /opt/spark directory. For these purpose, we will submit the following command lines :

tar -xvf spark-3.5.4-bin-hadoop3.tgz
sudo mv spark-3.5.4-bin-hadoop3 /opt/spark

Currently the Apache Spark packages has already been on /opt/spark/spark-3.5.4-bin-hadoop3 directory.

Step 5: Configure Apache Spark Environment Variables

Add Spark to the system’s PATH by editing the .bashrc or .bash_profile. For this purpose we will use vi text editory to update the file.

vi ~/.bashrc

then append the following lines :

export SPARK_HOME=/opt/spark/spark-3.5.4-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin

then apply the change, by submitting command line :

source ~/.bashrc

Step 6: Testing Apache Spark

After setting the environment, then we will test the Apache Spark by running the Spark Shell. Run the following command to start the Spark Shell.

spark-shell

the output will be shown below :

[ramansah@node01 spark-3.5.4-bin-hadoop3]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/26 18:01:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/12/26 18:01:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at http://node01.bckinfo:4041
Spark context available as 'sc' (master = local[*], app id = local-1735210911739).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.4
      /_/
         
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.13)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Step 7: Starting Apache Spark Stand Alone Master

In this section, we will start up the stand alon Spark server and monitor it via the provided web interface. To start up stand alone master server we will submit command line :

./start-master.sh

The output will be shown below :

[ramansah@node01 sbin]$ ./start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/spark-3.5.4-bin-hadoop3/logs/spark-ramansah-org.apache.spark.deploy.master.Master-1-node01.bckinfo.out

Then we also can monitor the port which is used by Apache Spark service by submitting command line :

sudo ss -tunelp | grep 8080

The output will be shown below :

[sudo] password for ramansah:
tcp LISTEN 0 1 *:8080 *:* users:((“java”,pid=3584,fd=268)) uid:1000 ino:47396 sk:9 cgroup:/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-dac35eff-9069-4fab-8fd4-2040562ba099.scope v6only:0 <->

And also we can verify if the Apache Spark service is running properly by submitting an URL from web browser as shown below. We will hit the http://<hostname_or_localhost>:8080.

Apache Spark Master running on port 8080

So far, we have installed Apache Spark 3.5.4 on CentOS Stream 9 successfully.

Conclusion

We have successfully installed Apache Spark version 3.5.4 on CentOS Stream 9. We can now use it for big data processing and analysis.

(Visited 350 times, 1 visits today)

Installing Apache Spark on CentOS Stream 9: A Beginner’s Tutorial

Introduction

Apache Spark Installation on CentOS Stream 9

Prerequisites

Step 1: Update_CentOS Stream 9 System

Step 2: Install Java (OpenJDK)

Step 3: Download Apache Spark 5.4.3 Packages

Step 4: Extract and Move Apache Spark

Step 5: Configure Apache Spark Environment Variables

Step 6: Testing Apache Spark

Step 7: Starting Apache Spark Stand Alone Master

Conclusion

Leave a Reply Cancel reply

You may also like

Ads

Search

Ads

Related

Recent Posts

Ads

🧭 PostgreSQL Tutorial

Introduction

Apache Spark Installation on CentOS Stream 9

Prerequisites

Conclusion

Related posts:

Leave a Reply Cancel reply

You may also like

Ads

Search

Ads

Related

Recent Posts

Ads