Spark with Scala


Dive Into Scala 2

Learning Objectives 2

Contents 2

Hands On 2

OOPS and Functional Programming in Scala 2

Learning Objectives 2

Contents 3

Hands On 3

Big Data and need for Apache Spark 3

Learning Objectives 3

Contents 3

Deep Dive in Apache Spark 3

Hands On 4

Demystify Apache Spark 4

Learning Objectives 4

Hands On 4

Playing with RDDs 5

Learning Objectives 5

Contents 5

Hands On 5

Spark SQL 5

Objectives 5

Hands On 6

Apache Spark Streaming 6

Objectives 6

Contents 6

Understanding Apache Kafka and Kafka Cluster 7

Objectives 7

Contents 7

Hands On 7

Capturing Data with Apache Flume and Integration with Kafka 7

Objectives 7

Contents 7

Hands On 8

Live projects on Apache Spark 8

YouTube Data Analysis 8

Titanic Data Analysis 8

Twitter Trends Analysis 8


Dive Into Scala

Learning Objectives

Understand the basics of Scala that are required for programming Spark applications

Learn about the basic constructs of Scala such as variable types, control structures, collections,and more.


What is Scala?

Why Scala for Spark?

Introduction to Scala REPL

Basic Scala operations

Variable types in Scala

Loops and collections Array, Map, Lists, Tuples

Functions and procedures in Scala

Eclipse with Scala

Hands On

Scala REPL Detailed Demo

Configuring scala with Eclipse


OOPS and Functional Programming in Scala

Learning Objectives

Learn about object oriented programming and functional programming techniques in Scala


Introduction to object oriented programming

Different oops concepts

Constructor, getter, setter, singleton, overloading and overriding

Nested Classes, Visibility Rules

Functional Structures

Functional programming constructs

Call by Name, Call by Value


Hands On

Create list of employee objects and sort them based on their firstname


Big Data and need for Apache Spark

Learning Objectives

Understand what is big data, challenges associated with it and the different frameworks available



Introduction to big data

Challenges with big data

Batch Vs. Real Time big data analytics

Batch Analytics – Hadoop Ecosystem Overview

Real-time Analytics Options

Streaming Data- Spark

In-memory data- Spark


Deep Dive in Apache Spark

What is Spark?

Spark Ecosystem,

Modes of Spark

Spark installation demo

Overview of Spark on a cluster

Spark Standalone cluster

Spark Web UI.

Configuring Spark in Eclipse

Running spark project with Eclipse


Hands On

Running Spark in Eclipse

Running spark in standalone mode

Running word count program


Demystify Apache Spark

Learning Objectives

Learn how to invoke Spark Shell and use it for various common operations.


Play with Spark shell

Execute Scala and Java statements in shell

Understand Spark Context and driver

Read data from local filesystem

Integrate Spark with HDFS

Cache the data in memory for further use

Distributed persistence

Hands On

Executing examples in Spark Shell

Running word count program


Playing with RDDs

Learning Objectives

Learn one of the fundamental building blocks of Spark – RDDs and related manipulations for implementing business logics.




Transformations in RDD

Actions in RDD

Loading data in RDD

Saving data through RDD

Key-Value Pair RDD


Broadcast Variables

MapReduce and Pair RDD Operations

Spark and Hadoop Integration-HDFS

Handling Sequence Files and Partitioner

Hands On

Analyse NASA Apache web logs, find out top servers

Find out the median salary of developers in different countries through the Stack Overflow survey data


Spark SQL


Understand techniques of executing SQL queries in Spark

Loading DBMS data into Spark


Introduction to Apache Spark SQL

The SQL context

Importing and saving data

Processing the Text files,JSON and Parquet Files


user-defined functions

Using Hive

Local Hive Metastore server


Hands On

Explore the price trend by looking at the real estate data in California


Apache Spark Streaming


Work on Spark streaming which is used to build scalable fault-tolerant streaming applications

Learn about DStreams and various Transformations performed on it

Learn about main streaming operators, Sliding Window Operators and Stateful Operators.



What is Spark Streaming?

Spark Streaming Features

Spark Streaming Workflow

Streaming Context & DStreams

Transformations on DStreams

WordCount Program using Spark Streaming

Important Windowed Operators

Slice, Window and ReduceByWindow Operators

Stateful Operators

Perform word count using Spark Streaming

Hands On:

Creating DStreams

Transactions and Actions performed on DStreams.

Output Operations in DStreams

Sliding Window Operations

Stateful Operations

Word count analysis


Understanding Apache Kafka and Kafka Cluster


Understand Kafka and Kafka Architecture

Learn how to configure different types of Kafka Cluster



Need for Kafka

What is Kafka?

Core Concepts of Kafka

Kafka Architecture

Where is Kafka Used?

Understanding the Components of Kafka Cluster

Configuring Kafka Cluster

Producer and Consumer

Hands On

Configuring Single Node Single Broker Cluster

Configuring Single Node Multi Broker Cluster


Capturing Data with Apache Flume and Integration with Kafka


Understand Apache Flume and its basic architecture

Integrate flume with Apache Kafka for event processing


Need of Apache Flume

What is Apache Flume?

Basic Flume Architecture

Flume Sources

Flume Sinks

Flume Channels

Flume Configuration

Integrating Apache Flume and Apache Kafka

Hands On

Flume Commands

Setting up Flume Agent

Streaming Access Logs into HDFS


Live projects on Apache Spark


YouTube Data Analysis

Analyze the YouTube Data and generate insights like top 10 most videos in various categories, User demographics, no of views, ratings etc. The data contains fields like Id, Age, Category, Length, Views, ratings, comments, etc.


Titanic Data Analysis

Titanic was one of the biggest disasters in the history of mankind, which happened due to natural events and human mistakes. The objective is to analyze Titanic data sets and generate various insights related to age, gender, survived, class, embarked, etc.


Twitter Trends Analysis

Collect Twitter data in real-time and find out what is currently trending on twitter in various categories. In this project, we will collect live Twitter streams and analyze the same using Spark Streaming to generate insights like finding the current trends in Politics, Finance, Entertainment, etc.