Apache Spark Workshop

The Workshop on Apache Spark was conducted on Wednesday, December 2nd, by Vinay Vaddiparthi, as a one and half hour session. The agenda for the workshop was to cover an overview, brief history and applications of Apache Spark. Vinay was also to cover MapReduce algorithm, RDDs, Transformations & Actions and a use case for Apache Spark.

Apache Spark is an open source cluster computing framework, originally developed in the AMPLab at University of California, Berkley. It is a fast, general purpose computational engine for large-scale data processing. It is written is Scala, a functional programming language. There are many advantages to using Apache Spark over Hadoop’s MapReduce paradigm. It is ten to hundred times faster than MapReduce on memory, it runs on powerful Scala, Python, Java and R APIs, and it brings the processing to data rather than the data to processing.

Inline image 1

Vinay then gave an overview of the Spark stack and explained the contrast between the use of MapReduce and Apache Spark. He then explained the concept of Resilient Distributed Datasets, their creation and use, and use of transformations and actions. Vinay demonstrated a use case in python on Apache Spark and its installation. At the end of the workshop he provided links to MOOCs with learning and practice material for Apache Spark.

IMG_0227    IMG_0228