What is Apache Spark and how does it help us with big data?

In order to understand Apache Spark, we have to define big data clearly

What is Big data?

Difficulties with the Big Data

What if the data is huge in volume. What’s the problem. So, here I listed out some majorly faced issues with Big Data:

  • Data storage and management
  • Data processing and transformation
  • Data access and retrieval

Data Lake

At the very beginning, Data lake used to synonyms to Hadoop but later it evolved as a platform with the given four capabilities which was the same problem recognized by Google:

  • Store
  • Process
  • Consume
  • Consuming: JDBC, File download, Rest Interface, Search …
  • Processing: Apache Spark …
  • Storing: HDFS, Azure blob, Google Cloud, Amazon s3 …

Apache Spark

Apache Spark Ecosystem

  • Spark Core

Cluster Manager

However, Spark does not provide us the cluster manager which also known as resource manager. Spark was initially based on Hadoop. So, Hadoop Yarn resource manager is the widely used cluster manager but spark offers more as Mesos and Kubernetes.

Distributed Storage

Along with the cluster manager, we also need the space to store our data. Apache spark doesn’t come with an internal storage system for that we can use a different distributed system like Hadoop distributed file system(HDFS), Amazon S3, Google cloud storage(GCS), Azure Blob ….

Spark Engine

Spark engine do the several things like it will distribute your tasks into the smaller task, scheduling those tasks into the cluster for the parallel execution, providing data to these tasks, managing and monitoring those tasks

Spark Core

This is the programming application layer that provides you the core APIs in four programming languages Python, Scala, Java, and R. These APIs are based on Resilient Distributed Datasets (RDD). Some people find it hard to use in their day-to-day life. It will also lack some sort of performance issues. I’ll suggest not to use the spark core APIs until you have a very Complex Problem to solve.

Spark APIs, DSLs, and Libraries

We categorized this part of the area into four categorize. However, there are no hard boundaries for this partition this is simply based on logical grouping you may find it overlapping on your day-to-day life/ projects. This is developed by Spark communities over the Spark core APIs which internally calls the Spark APIs.

  • Streaming: It allows us to use a continuous stream of data
  • Mllib: This meant for machine learning and some AI computation
  • GraphX: As the name suggests it is used for graph computation.

Development intern at ThoughtWorks

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store