What is Apache Spark and how does it help us with big data?

Vikram Singh
5 min readApr 25, 2021

In order to understand Apache Spark, we have to define big data clearly

What is Big data?

Big data is data that is either high in volume, variety, or velocity. This data is generated from various ways like logs, inventories, marketing spends, etc.

With the evolution of the internet, data starts to grow in size. The scope and the reach of the software applications now expand the geographical boundaries and spread across the planet. Let’s take the example: There are 2.6 billion active accounts on Facebook, Over 1 billion people used Instagram monthly basis.

Big data comes with big difficulties.

Difficulties with the Big Data

What if the data is huge in volume. What’s the problem. So, here I listed out some majorly faced issues with Big Data:

  • Data Collection
  • Data storage and management
  • Data processing and transformation
  • Data access and retrieval

Google was the one who identified this issue and comes up with the solution. Google published its first white paper on Google file system in 2003 and second Map Reduce in 2004. Later on, Hadoop distributed file system came into the picture which tackles the problem of storing the large number of data similar to the Google file system. HDFS allows us to form a cluster of computers and used the combined capacity for storage. Hadoop Map Reduce for processing and transformation similar to Google Map reduce. Same as HDFS Hadoop MR allowed us to use a group of a cluster to process the large number of data which we stored in HDFS.

Data Lake

At the very beginning, Data lake used to synonyms to Hadoop but later it evolved as a platform with the given four capabilities which was the same problem recognized by Google:

  • Ingest
  • Store
  • Process
  • Consume

We have different tools/framework for solving this problem:

  • Ingesting: HVR, AWS Glue, Informatica, talend …
  • Consuming: JDBC, File download, Rest Interface, Search …
  • Processing: Apache Spark …
  • Storing: HDFS, Azure blob, Google Cloud, Amazon s3 …

Apache Spark

Apache Spark is a data processing framework that can perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers.

Apache Spark Ecosystem

Let’s understand the spark ecosystem. We categorized the spark ecosystem into two parts.The First one is Spark core and the other one is a set of API, DSL, and SQL.

This runs on computer hardware.

We further categorized Spark core into two parts:

  • Spark Engine
  • Spark Core

Cluster Manager

However, Spark does not provide us the cluster manager which also known as resource manager. Spark was initially based on Hadoop. So, Hadoop Yarn resource manager is the widely used cluster manager but spark offers more as Mesos and Kubernetes.

Distributed Storage

Along with the cluster manager, we also need the space to store our data. Apache spark doesn’t come with an internal storage system for that we can use a different distributed system like Hadoop distributed file system(HDFS), Amazon S3, Google cloud storage(GCS), Azure Blob ….

Spark Engine

Spark engine do the several things like it will distribute your tasks into the smaller task, scheduling those tasks into the cluster for the parallel execution, providing data to these tasks, managing and monitoring those tasks

Spark Core

This is the programming application layer that provides you the core APIs in four programming languages Python, Scala, Java, and R. These APIs are based on Resilient Distributed Datasets (RDD). Some people find it hard to use in their day-to-day life. It will also lack some sort of performance issues. I’ll suggest not to use the spark core APIs until you have a very Complex Problem to solve.

What to use if not to use Spark Core APIs ????

To answer this question here comes the last part of our spark ecosystem.

Spark APIs, DSLs, and Libraries

We categorized this part of the area into four categorize. However, there are no hard boundaries for this partition this is simply based on logical grouping you may find it overlapping on your day-to-day life/ projects. This is developed by Spark communities over the Spark core APIs which internally calls the Spark APIs.

  • Spark SQL Data Frames: We can use the simple SQL data frames to process the data which is available on all the apache spark supported programming languages(Java, Python, Scala, and R).
  • Streaming: It allows us to use a continuous stream of data
  • Mllib: This meant for machine learning and some AI computation
  • GraphX: As the name suggests it is used for graph computation.

We’ve now come to an end where we discussed that what big data is and its problems, the data lake concept, and the apache spark.
Please let me know if there is any part of the area which I can improve or anything.
Thanks for reading :)

--

--

Vikram Singh

Developer @ /thoughtworks | nodejs | Apache Kafka | Springboot | Java | React