Spark with Scala



Dive Into Scala 2

Learning Objectives 2

Contents 2

Hands On 2

OOPS and Functional Programming in Scala 2

Learning Objectives 2

Contents 3

Hands On 3

Big Data and need for Apache Spark 3

Learning Objectives 3

Contents 3

Deep Dive in Apache Spark 3

Hands On 4

Demystify Apache Spark 4

Learning Objectives 4

Hands On 4

Playing with RDDs 5

Learning Objectives 5

Contents 5

Hands On 5

Spark SQL 5

Objectives 5

Hands On 6

Apache Spark Streaming 6

Objectives 6

Contents 6

Understanding Apache Kafka and Kafka Cluster 7

Objectives 7

Contents 7

Hands On 7

Capturing Data with Apache Flume and Integration with Kafka 7

Objectives 7

Contents 7

Hands On 8

Live projects on Apache Spark 8

YouTube Data Analysis 8

Titanic Data Analysis 8

Twitter Trends Analysis 8

 

Dive Into Scala

Learning Objectives

Understand the basics of Scala that are required for programming Spark applications

Learn about the basic constructs of Scala such as variable types, control structures, collections,and more.

Contents

What is Scala?

Why Scala for Spark?

Introduction to Scala REPL

Basic Scala operations

Variable types in Scala

Loops and collections Array, Map, Lists, Tuples

Functions and procedures in Scala

Eclipse with Scala

Hands On

Scala REPL Detailed Demo

Configuring scala with Eclipse

 

OOPS and Functional Programming in Scala

Learning Objectives

Learn about object oriented programming and functional programming techniques in Scala

Contents

Introduction to object oriented programming

Different oops concepts

Constructor, getter, setter, singleton, overloading and overriding

Nested Classes, Visibility Rules

Functional Structures

Functional programming constructs

Call by Name, Call by Value

 

Hands On

Create list of employee objects and sort them based on their firstname

 

Big Data and need for Apache Spark

Learning Objectives

Understand what is big data, challenges associated with it and the different frameworks available

 

Contents

Introduction to big data

Challenges with big data

Batch Vs. Real Time big data analytics

Batch Analytics – Hadoop Ecosystem Overview

Real-time Analytics Options

Streaming Data- Spark

In-memory data- Spark

 

Deep Dive in Apache Spark

What is Spark?

Spark Ecosystem,

Modes of Spark

Spark installation demo

Overview of Spark on a cluster

Spark Standalone cluster

Spark Web UI.

Configuring Spark in Eclipse

Running spark project with Eclipse

 

Hands On

Running Spark in Eclipse

Running spark in standalone mode

Running word count program

 

Demystify Apache Spark

Learning Objectives

Learn how to invoke Spark Shell and use it for various common operations.

 

Play with Spark shell

Execute Scala and Java statements in shell

Understand Spark Context and driver

Read data from local filesystem

Integrate Spark with HDFS

Cache the data in memory for further use

Distributed persistence

Hands On

Executing examples in Spark Shell

Running word count program

 

Playing with RDDs

Learning Objectives

Learn one of the fundamental building blocks of Spark – RDDs and related manipulations for implementing business logics.

 

Contents

RDDs

Transformations in RDD

Actions in RDD

Loading data in RDD

Saving data through RDD

Key-Value Pair RDD

Accumulators

Broadcast Variables

MapReduce and Pair RDD Operations

Spark and Hadoop Integration-HDFS

Handling Sequence Files and Partitioner

Hands On

Analyse NASA Apache web logs, find out top servers

Find out the median salary of developers in different countries through the Stack Overflow survey data

 

Spark SQL

Objectives

Understand techniques of executing SQL queries in Spark

Loading DBMS data into Spark

 

Introduction to Apache Spark SQL

The SQL context

Importing and saving data

Processing the Text files,JSON and Parquet Files

DataFrames

user-defined functions

Using Hive

Local Hive Metastore server

 

Hands On

Explore the price trend by looking at the real estate data in California

 

Apache Spark Streaming

Objectives

Work on Spark streaming which is used to build scalable fault-tolerant streaming applications

Learn about DStreams and various Transformations performed on it

Learn about main streaming operators, Sliding Window Operators and Stateful Operators.

 

Contents

What is Spark Streaming?

Spark Streaming Features

Spark Streaming Workflow

Streaming Context & DStreams

Transformations on DStreams

WordCount Program using Spark Streaming

Important Windowed Operators

Slice, Window and ReduceByWindow Operators

Stateful Operators

Perform word count using Spark Streaming

Hands On:

Creating DStreams

Transactions and Actions performed on DStreams.

Output Operations in DStreams

Sliding Window Operations

Stateful Operations

Word count analysis

 

Understanding Apache Kafka and Kafka Cluster

Objectives

Understand Kafka and Kafka Architecture

Learn how to configure different types of Kafka Cluster

 

Contents

Need for Kafka

What is Kafka?

Core Concepts of Kafka

Kafka Architecture

Where is Kafka Used?

Understanding the Components of Kafka Cluster

Configuring Kafka Cluster

Producer and Consumer

Hands On

Configuring Single Node Single Broker Cluster

Configuring Single Node Multi Broker Cluster

 

Capturing Data with Apache Flume and Integration with Kafka

Objectives

Understand Apache Flume and its basic architecture

Integrate flume with Apache Kafka for event processing

Contents

Need of Apache Flume

What is Apache Flume?

Basic Flume Architecture

Flume Sources

Flume Sinks

Flume Channels

Flume Configuration

Integrating Apache Flume and Apache Kafka

Hands On

Flume Commands

Setting up Flume Agent

Streaming Access Logs into HDFS

 

Live projects on Apache Spark

 

YouTube Data Analysis

Analyze the YouTube Data and generate insights like top 10 most videos in various categories, User demographics, no of views, ratings etc. The data contains fields like Id, Age, Category, Length, Views, ratings, comments, etc.

 

Titanic Data Analysis

Titanic was one of the biggest disasters in the history of mankind, which happened due to natural events and human mistakes. The objective is to analyze Titanic data sets and generate various insights related to age, gender, survived, class, embarked, etc.

 

Twitter Trends Analysis

Collect Twitter data in real-time and find out what is currently trending on twitter in various categories. In this project, we will collect live Twitter streams and analyze the same using Spark Streaming to generate insights like finding the current trends in Politics, Finance, Entertainment, etc.