Spark vs. Hadoop: An Introduction

As enterprises become more comfortable with the idea of moving their workloads to the cloud, so too comes an increased interest in the capabilities these new platforms are offering. As a data strategist, these conversations aren’t new as “Big Data” has been around for a while. And while even my technologically-challenged mother has heard of Hadoop, there is still a general lack of understanding as to how these platforms work.

What is Hadoop?

Apache Hadoop has been mainstream for nearly a decade now, but

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The Apache™ Software Foundation

What is Spark?

While there are those who still consider Spark to be part of the Hadoop architecture, it’s difficult to consider it as anything but a competitor.


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.