Install and Configure Apache Spark on Oracle Linux 9/CentOS 9

Apache Spark is an open-source simple, fast, scalable, and integrated multilingual integrated analytics engine. It is used for large-scale data processing, data engineering, data science, and machine learning on single-node machines or clusters. It offers high-level APIs for Java, Scala, Python, and R, and an optimized engine that supports common execution graphs. It also includes Spark SQL for SQL and structured data processing, Pandas API on Spark for Pandas workloads, MLlib for machine learning, GraphX ​​for graph processing, and Structured Streaming for incremental computation and stream processing.

Apache Spark was developed and released by Databricks, Apache Software Foundation, and Holden Karau in 2014 and is licensed under Apache License 2.0. It’s programmed in Python, Scala, Java, and R.

As of this writing, the latest release is version 3.3.1. It is available as a Binary/library or as a source file. Spark uses Hadoop’s client libraries for HDFS and YARN. Scala and Java developers can use Maven coordinates to include Spark in their projects, and Python developers can install Spark from PyPI. Spark runs on both Windows and UNIX-like systems, on macOS and platforms that support JVMs on x86_64 and ARM64.

To successfully install Apache Spark on your system, you must install JAVA in your system’s PATH or the JAVA_HOME environment variable pointing appropriately to your Java installation. Spark works with Java 8/11/17/21, Scala 2.12/2.13, Python 3.7+, and R 3.5+. Python 3.9 is not supported.

Key features

Apache Spark has key features:

  • Batch/streaming data – Real-time batching / streaming of data is enhanced by Python, SQL, Scala, Java, or R.
  • SQL analytics – Run fast distributed ANSI SQL queries for dashboards and ad-hoc reports. Apache Spark runs faster than most data warehouses.
  • Data science at scale – Perform exploratory data analysis (EDA) on petabyte-scale data without resorting to downsampling.
  • Machine learning – Train machine learning algorithms on your laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
  • Lightweight –  It is a light unified analytics engine that is used for large-scale data processing.
  • Apache Spark integrates with a majority of frameworks and helps them to scale to multiple machines. e.g Power BI, tableau, etc
  • supports several storage and infrastructure e.g elasticsearch, MongoDB, Kubernetes, Kafka, Cassandra, SQL Server, etc.
  • Spark SQL works on structured tables and unstructured data such as JSON or images.
  • Spark SQL adapts the execution plan at runtime e.g automatically setting the number of reducers and join algorithms.
  • It has hundreds of community contributors across the globe.
  • Spark is easy to use – No need to worry about the cluster of computers you are working on. You can simply use a single machine.
  • Spark has APIs for both Scala and Python. Simply choose your preferred programming language.

 Install and Configure Apache Spark on Oracle Linux 9/CentOS 9

To install Apache Spark, your system should meet the following minimum requirements:

  • Minimum memory of 8GB of RAM
  • Java 8 or above.
  • At least 20 GB of free space
  • Stable internet connection.
  • An account with sudo privileges.

Before the installation process, update your system package repository.

sudo dnf update -y

Begin the installation process.

Step 1: Install Java

I will install the latest JAVA version, i.e JAVA 21.

Install Java using yum / dnf

Run the commands below to install Java:

sudo yum install -y java-21-openjdk java-21-openjdk-devel

Manual installation of Java

Install required packages:

sudo dnf -y install wget curl

Navigate to the official JAVA Downloads page for the latest release. Select the x64 RPM Package by copying the link address.

Download the package as shown below.

sudo wget https://download.oracle.com/java/21/latest/jdk-21_linux-x64_bin.rpm

Once the RPM file is downloaded, install Java 17 in your system by the following command.

sudo rpm -Uvh jdk-21_linux-x64_bin.rpm

Inspect the installed version of Java.

$ java -version
openjdk version "21.0.5" 2024-10-15 LTS
OpenJDK Runtime Environment (Red_Hat-21.0.5.0.11-1.0.1) (build 21.0.5+11-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-21.0.5.0.11-1.0.1) (build 21.0.5+11-LTS, mixed mode, sharing)

Set Java Environment variables

sudo tee /etc/profile.d/java21.sh <<EOF
export JAVA_HOME=\$(dirname \$(dirname \$(readlink \$(readlink \$(which javac)))))
export PATH=\$PATH:\$JAVA_HOME/bin
export CLASSPATH=.:\$JAVA_HOME/jre/lib:\$JAVA_HOME/lib:\$JAVA_HOME/lib/tools.jar
EOF

Activate the Java environment:

source /etc/profile.d/java21.sh

Examine the Java variables by running the commands below.

$ echo $JAVA_HOME
/usr/lib/jvm/java-21-openjdk-21.0.5.0.11-2.0.1.el9.x86_64

$ echo $PATH
/root/.local/bin:/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-21-openjdk-21.0.5.0.11-2.0.1.el9.x86_64/bin

$ echo $CLASSPATH 
.:/usr/lib/jvm/java-21-openjdk-21.0.5.0.11-2.0.1.el9.x86_64/jre/lib:/usr/lib/jvm/java-21-openjdk-21.0.5.0.11-2.0.1.el9.x86_64/lib:/usr/lib/jvm/java-21-openjdk-21.0.5.0.11-2.0.1.el9.x86_64/lib/tools.jar

Step 2: Download and install Scala

Download the latest version of Scala by visiting the SCala Downloads page and copying the installation script.

Run the following command in your terminal:

sudo dnf install -y wget curl
curl -fL https://github.com/coursier/coursier/releases/latest/download/cs-x86_64-pc-linux.gz | gzip -d > cs && chmod +x cs && ./cs setup

Set Scala environment variables:

sudo tee  -a ~/.bashrc <<EOF
export SCALA_HOME=~/.local/share/coursier
export PATH=\$PATH:\$SCALA_HOME/bin
EOF

Source the file to start using without logging:

source ~/.bashrc

Check the installed Scala version:

$ scala -version
Scala code runner version: 1.5.4
Scala version (default): 3.6.3

Scala latest version has been installed.

Step 3: Downloading & Installing Apache Spark

The next step is to Download and install Apache Spark in your system. Download the latest version of Spark on the official website. The latest release is version 3.5.4. For the package, choose ‘Pre-built for Apache Hadoop. The Apache Spark repositories do not default in the Oracle Linux upstream repositories. you have to download it and do a manual installation. I have moved my download to my home directory.

cd ~/
sudo wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz

Extract the Spark tarball.

tar xvf spark-3.5.4-bin-hadoop3.tgz

Move the Spark software files to the respective directory (/usr/local/spark).

sudo mv spark-3.5.4-bin-hadoop3 /usr/local/spark

Set up the environment for Spark.

sudo vim ~/.bashrc
#add this line
export PATH=$PATH:/usr/local/spark/bin

Source the ~/.bashrc file.

source ~/.bashrc

Step 5: Verifying the Spark Installation

To verify Spark installation, run the command below.

spark-shell --conf spark.driver.bindAddress=localhost

This will give the following output:

Now we have successfully installed spark on Oracle Linux System.

To check the version from the Spark shell, issue the command:

scala> sc.version
res0: String = 3.5.4
scala> spark.version
res1: String = 3.5.4

Spark context Web UI available at http://<server_ip_address>:4040.

To access the Web UI, allow port 4040 through the firewall.

sudo firewall-cmd --zone=public --add-port=4040/tcp --permanent
sudo firewall-cmd --reload

Now access the Web UI on your browser:

Step 6: Create a simple RDD

For demonstration, I will create a simple RDD and Dataframe. RDD can be created in 3 ways.

Define a list then parallelize it to create RDD.

scala> val nums = Array(1,2,3,5,6)
nums: Array[Int] = Array(1, 2, 3, 5, 6)

scala> val rdd = sc.parallelize(nums)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

Create a Data frame from RDD.

scala> import spark.implicits._
import spark.implicits._

scala> val df = rdd.toDF("num")
df: org.apache.spark.sql.DataFrame = [num: int]

To display the data in Dataframe:

scala> df.show()
+---+
|num|
+---+
|  1|
|  2|
|  3|
|  5|
|  6|
+---+

Conclusion.

That marks the end of our guide today. We have covered the basic installation and looked at the Apache Spark features. In the coming guides, we will take a deep dive. In the meantime, please visit Apache Spark Documentation for more insights. I hope your installation was successful.

Read more:

Your IT Journey Starts Here!

Ready to level up your IT skills? Our new eLearning platform is coming soon to help you master the latest technologies.

Be the first to know when we launch! Join our waitlist now.

Join our Linux and open source community. Subscribe to our newsletter for tips, tricks, and collaboration opportunities!

Recent Post

Leave a Comment

Your email address will not be published. Required fields are marked *

Related Post

Greetings and salutations. In our guide today we will analyze Apache Solr, a standalone, open-source enterprise search and analytics server […]

Python developers must embark on a Framework. A Framework is a library that makes developers’ work easy by providing them […]

SparkyLinux is an open-source, lightweight, fast, and customizable GNU/Linux distribution based on Debian GNU/Linux operating distribution. It is a desktop-oriented […]

Let's Connect

Unleash the full potential of your business with CloudSpinx. Our expert solutions specialists are standing by to answer your questions and tailor a plan that perfectly aligns with your unique needs.
You will get a response from our solutions specialist within 12 hours
We understand emergencies can be stressful. For immediate assistance, chat with us now

Contact CloudSpinx today!

Download CloudSpinx Profile

Discover the full spectrum of our expertise and services by downloading our detailed Company Profile. Simply enter your first name, last name, and email address.