Course Outline

Section 1: Introduction to Hadoop

  • hadoop history, concepts
  • eco system
  • distributions
  • high level architecture
  • hadoop myths
  • hadoop challenges
  • hardware / software
  • lab : first look at Hadoop

Section 2: HDFS

  • Design and architecture
  • concepts (horizontal scaling, replication, data locality, rack awareness)
  • Daemons : Namenode, Secondary namenode, Data node
  • communications / heart-beats
  • data integrity
  • read / write path
  • Namenode High Availability (HA), Federation
  • labs : Interacting with HDFS

Section 3 : Map Reduce

  • concepts and architecture
  • daemons (MRV1) : jobtracker / tasktracker
  • phases : driver, mapper, shuffle/sort, reducer
  • Map Reduce Version 1 and Version 2 (YARN)
  • Internals of Map Reduce
  • Introduction to Java Map Reduce program
  • labs : Running a sample MapReduce program

Section 4 : Pig

  • pig vs java map reduce
  • pig job flow
  • pig latin language
  • ETL with Pig
  • Transformations & Joins
  • User defined functions (UDF)
  • labs : writing Pig scripts to analyze data

Section 5: Hive

  • architecture and design
  • data types
  • SQL support in Hive
  • Creating Hive tables and querying
  • partitions
  • joins
  • text processing
  • labs : various labs on processing data with Hive

Section 6: HBase

  • concepts and architecture
  • hbase vs RDBMS vs cassandra
  • HBase Java API
  • Time series data on HBase
  • schema design
  • labs : Interacting with HBase using shell;   programming in HBase Java API ; Schema design exercise

Requirements

  • comfortable with Java programming language (most programming exercises are in java)
  • comfortable in Linux environment (be able to navigate Linux command line, edit files using vi / nano)

Lab environment

Zero Install : There is no need to install hadoop software on students’ machines! A working hadoop cluster will be provided for students.

Students will need the following

  • an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
  • a browser to access the cluster. We recommend Firefox browser
 28 Hours

Number of participants



Price per participant

Testimonials (5)

Related Courses

Hortonworks Data Platform (HDP) for Administrators

21 Hours

Apache Ambari: Efficiently Manage Hadoop Clusters

21 Hours

Impala for Business Intelligence

21 Hours

Data Analysis with Hive/HiveQL

7 Hours

Administrator Training for Apache Hadoop

35 Hours

Big Data Analytics in Health

21 Hours

Datameer for Data Analysts

14 Hours

Hadoop Administration

21 Hours

Hadoop For Administrators

21 Hours

Advanced Hadoop for Developers

21 Hours

Hadoop for Developers and Administrators

21 Hours

Hadoop for Project Managers

14 Hours

Hadoop Administration on MapR

28 Hours

Hadoop with Python

28 Hours

Hadoop and Spark for Administrators

35 Hours

Related Categories

1