Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
-
Introduction
- Hadoop history and core concepts
- Ecosystem overview
- Distributions
- High-level architecture
- Common Hadoop myths
- Hadoop challenges (hardware and software)
- Labs: Discussion of participants' Big Data projects and challenges
-
Planning and installation
- Selecting software and Hadoop distributions
- Cluster sizing and growth planning
- Selecting hardware and network infrastructure
- Rack topology
- Installation procedures
- Multi-tenancy configurations
- Directory structure and log management
- Benchmarking techniques
- Labs: Cluster installation and performance benchmarking
-
HDFS operations
- Core concepts (horizontal scaling, replication, data locality, rack awareness)
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring strategies
- Administration via command-line and browser interfaces
- Adding storage and replacing defective drives
- Labs: Familiarization with HDFS command lines
-
Data ingestion
- Using Flume for logs and data ingestion into HDFS
- Utilizing Sqoop for importing data from SQL databases to HDFS and exporting back
- Hadoop data warehousing with Hive
- Transferring data between clusters using distcp
- Leveraging S3 as a complement to HDFS
- Best practices and architectures for data ingestion
- Labs: Setting up and utilizing Flume and Sqoop
-
MapReduce operations and administration
- Parallel computing before MapReduce: Comparing HPC with Hadoop administration
- MapReduce cluster loads
- Nodes and Daemons (JobTracker, TaskTracker)
- Walkthrough of the MapReduce UI
- MapReduce configuration
- Job configuration
- Optimizing MapReduce performance
- Ensuring robustness: Guidance for programmers
- Labs: Running MapReduce examples
-
YARN: New architecture and capabilities
- YARN design goals and implementation architecture
- New actors: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling under YARN
- Labs: Investigating job scheduling mechanisms
-
Advanced topics
- Hardware monitoring
- Cluster monitoring
- Adding and removing servers, and upgrading Hadoop
- Backup, recovery, and business continuity planning
- Oozie job workflows
- Hadoop high availability (HA)
- Hadoop Federation
- Securing your cluster with Kerberos
- Labs: Setting up monitoring systems
-
Optional tracks
- Cloudera Manager for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are performed within the Cloudera distribution environment (CDH5)
- Ambari for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are performed within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0)
Requirements
- Familiarity with basic Linux system administration
- Fundamental scripting skills
While knowledge of Hadoop and Distributed Computing is not required, these topics will be introduced and explained throughout the course.
Lab environment
Zero Install: There is no need to install Hadoop software on students’ personal machines. A fully functional Hadoop cluster will be provided for all exercises.
Students must have the following prerequisites:
- An SSH client (Linux and Mac systems come with built-in SSH clients; PuTTY is recommended for Windows)
- A web browser to access the cluster interface. We recommend using Firefox with the FoxyProxy extension installed.
21 Hours
Testimonials (1)
Hands on exercises. Class should have been 5 days, but the 3 days helped to clear up a lot of questions that I had from working with NiFi already