Hadoop Beginner's Guide

Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services – just a willingness to learn the basics from this practical step-by-step tutorial.
Preview in Mapt

Hadoop Beginner's Guide

Garry Turkington

1 customer reviews
Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services – just a willingness to learn the basics from this practical step-by-step tutorial.
Mapt Subscription
FREE
$29.99/m after trial
eBook
$21.00
RRP $29.99
Save 29%
Print + eBook
$49.99
RRP $49.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$21.00
$49.99
$29.99 p/m after trial
RRP $29.99
RRP $49.99
Subscription
eBook
Print + eBook
Start 30 Day Trial

Frequently bought together


Hadoop Beginner's Guide Book Cover
Hadoop Beginner's Guide
$ 29.99
$ 21.00
AWS Administration - The Definitive Guide - Second Edition Book Cover
AWS Administration - The Definitive Guide - Second Edition
$ 35.99
$ 25.20
Buy 2 for $35.00
Save $30.98
Add to Cart

Book Details

ISBN 139781849517300
Paperback398 pages

Book Description

Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills.

"Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.

Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.

While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.

In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

Table of Contents

Chapter 1: What It's All About
Big data processing
Cloud computing with Amazon Web Services
Summary
Chapter 2: Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Time for action – checking the prerequisites
Time for action – downloading Hadoop
Time for action – setting up SSH
Time for action – using Hadoop to calculate Pi
Time for action – configuring the pseudo-distributed mode
Time for action – changing the base HDFS directory
Time for action – formatting the NameNode
Time for action – starting Hadoop
Time for action – using HDFS
Time for action – WordCount, the Hello World of MapReduce
Using Elastic MapReduce
Time for action – WordCount on EMR using the management console
Comparison of local versus EMR Hadoop
Summary
Chapter 3: Understanding MapReduce
Key/value pairs
The Hadoop Java API for MapReduce
Writing MapReduce programs
Time for action – setting up the classpath
Time for action – implementing WordCount
Time for action – building a JAR file
Time for action – running WordCount on a local Hadoop cluster
Time for action – running WordCount on EMR
Time for action – WordCount the easy way
Walking through a run of WordCount
Time for action – WordCount with a combiner
Time for action – fixing WordCount to work with a combiner
Hadoop-specific data types
Time for action – using the Writable wrapper classes
Input/output
Summary
Chapter 4: Developing MapReduce Programs
Using languages other than Java with Hadoop
Time for action – implementing WordCount using Streaming
Analyzing a large dataset
Time for action – summarizing the UFO data
Time for action – summarizing the shape data
Time for action – correlating of sighting duration to UFO shape
Time for action – performing the shape/time analysis from the command line
Time for action – using ChainMapper for field validation/analysis
Time for action – using the Distributed Cache to improve location output
Counters, status, and other output
Time for action – creating counters, task states, and writing log output
Summary
Chapter 5: Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
Time for action – reduce-side join using MultipleInputs
Graph algorithms
Time for action – representing the graph
Time for action – creating the source code
Time for action – the first run
Time for action – the second run
Time for action – the third run
Time for action – the fourth and last run
Using language-independent data structures
Time for action – getting and installing Avro
Time for action – defining the schema
Time for action – creating the source Avro data with Ruby
Time for action – consuming the Avro data with Java
Time for action – generating shape summaries in MapReduce
Time for action – examining the output data with Ruby
Time for action – examining the output data with Java
Summary
Chapter 6: When Things Break
Failure
Time for action – killing a DataNode process
Time for action – the replication factor in action
Time for action – intentionally causing missing blocks
Time for action – killing a TaskTracker process
Time for action – killing the JobTracker
Time for action – killing the NameNode process
Time for action – causing task failure
Time for action – handling dirty data by using skip mode
Summary
Chapter 7: Keeping Things Running
A note on EMR
Hadoop configuration properties
Time for action – browsing default properties
Setting up a cluster
Time for action – examining the default rack configuration
Time for action – adding a rack awareness script
Cluster access control
Time for action – demonstrating the default security
Managing the NameNode
Time for action – adding an additional fsimage location
Time for action – swapping to a new NameNode host
Managing HDFS
MapReduce management
Time for action – changing job priorities and killing a job
Scaling
Summary
Chapter 8: A Relational View on Data with Hive
Overview of Hive
Setting up Hive
Time for action – installing Hive
Using Hive
Time for action – creating a table for the UFO data
Time for action – inserting the UFO data
Time for action – validating the table
Time for action – redefining the table with the correct column separator
Time for action – creating a table from an existing file
Time for action – performing a join
Time for action – using views
Time for action – exporting query output
Time for action – making a partitioned UFO sighting table
Time for action – adding a new User Defined Function (UDF)
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
Summary
Chapter 9: Working with Relational Databases
Common data paths
Setting up MySQL
Time for action – installing and setting up MySQL
Time for action – configuring MySQL to allow remote connections
Time for action – setting up the employee database
Getting data into Hadoop
Time for action – downloading and configuring Sqoop
Time for action – exporting data from MySQL to HDFS
Time for action – exporting data from MySQL into Hive
Time for action – a more selective import
Time for action – using a type mapping
Time for action – importing data from a raw query
Getting data out of Hadoop
Time for action – importing data from Hadoop into MySQL
Time for action – importing Hive data into MySQL
Time for action – fixing the mapping and re-running the export
AWS considerations
Summary
Chapter 10: Data Collection with Flume
A note about AWS
Data data everywhere...
Time for action – getting web server data into Hadoop
Introducing Apache Flume
Time for action – installing and configuring Flume
Time for action – capturing network traffic in a log file
Time for action – logging to the console
Time for action – capturing the output of a command to a flat file
Time for action – capturing a remote file in a local flat file
Time for action – writing network traffic onto HDFS
Time for action – adding timestamps
Time for action – multi level Flume networks
Time for action – writing to multiple sinks
The bigger picture
Summary
Chapter 11: Where to Go Next
What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Other Apache projects
Other programming abstractions
AWS resources
Sources of information
Summary

What You Will Learn

  • The trends that led to Hadoop and cloud services, giving the background to know when to use the technology
  • Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand
  • Developing applications to run on Hadoop with examples in Java and Ruby
  • How Amazon Web Services can be used to deliver a hosted Hadoop solution and how this differs from directly-managed environments
  • Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer
  • How Flume can collect data from multiple sources and deliver it to Hadoop for processing
  • What other projects and tools make up the broader Hadoop ecosystem and where to go next

Authors

Table of Contents

Chapter 1: What It's All About
Big data processing
Cloud computing with Amazon Web Services
Summary
Chapter 2: Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Time for action – checking the prerequisites
Time for action – downloading Hadoop
Time for action – setting up SSH
Time for action – using Hadoop to calculate Pi
Time for action – configuring the pseudo-distributed mode
Time for action – changing the base HDFS directory
Time for action – formatting the NameNode
Time for action – starting Hadoop
Time for action – using HDFS
Time for action – WordCount, the Hello World of MapReduce
Using Elastic MapReduce
Time for action – WordCount on EMR using the management console
Comparison of local versus EMR Hadoop
Summary
Chapter 3: Understanding MapReduce
Key/value pairs
The Hadoop Java API for MapReduce
Writing MapReduce programs
Time for action – setting up the classpath
Time for action – implementing WordCount
Time for action – building a JAR file
Time for action – running WordCount on a local Hadoop cluster
Time for action – running WordCount on EMR
Time for action – WordCount the easy way
Walking through a run of WordCount
Time for action – WordCount with a combiner
Time for action – fixing WordCount to work with a combiner
Hadoop-specific data types
Time for action – using the Writable wrapper classes
Input/output
Summary
Chapter 4: Developing MapReduce Programs
Using languages other than Java with Hadoop
Time for action – implementing WordCount using Streaming
Analyzing a large dataset
Time for action – summarizing the UFO data
Time for action – summarizing the shape data
Time for action – correlating of sighting duration to UFO shape
Time for action – performing the shape/time analysis from the command line
Time for action – using ChainMapper for field validation/analysis
Time for action – using the Distributed Cache to improve location output
Counters, status, and other output
Time for action – creating counters, task states, and writing log output
Summary
Chapter 5: Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
Time for action – reduce-side join using MultipleInputs
Graph algorithms
Time for action – representing the graph
Time for action – creating the source code
Time for action – the first run
Time for action – the second run
Time for action – the third run
Time for action – the fourth and last run
Using language-independent data structures
Time for action – getting and installing Avro
Time for action – defining the schema
Time for action – creating the source Avro data with Ruby
Time for action – consuming the Avro data with Java
Time for action – generating shape summaries in MapReduce
Time for action – examining the output data with Ruby
Time for action – examining the output data with Java
Summary
Chapter 6: When Things Break
Failure
Time for action – killing a DataNode process
Time for action – the replication factor in action
Time for action – intentionally causing missing blocks
Time for action – killing a TaskTracker process
Time for action – killing the JobTracker
Time for action – killing the NameNode process
Time for action – causing task failure
Time for action – handling dirty data by using skip mode
Summary
Chapter 7: Keeping Things Running
A note on EMR
Hadoop configuration properties
Time for action – browsing default properties
Setting up a cluster
Time for action – examining the default rack configuration
Time for action – adding a rack awareness script
Cluster access control
Time for action – demonstrating the default security
Managing the NameNode
Time for action – adding an additional fsimage location
Time for action – swapping to a new NameNode host
Managing HDFS
MapReduce management
Time for action – changing job priorities and killing a job
Scaling
Summary
Chapter 8: A Relational View on Data with Hive
Overview of Hive
Setting up Hive
Time for action – installing Hive
Using Hive
Time for action – creating a table for the UFO data
Time for action – inserting the UFO data
Time for action – validating the table
Time for action – redefining the table with the correct column separator
Time for action – creating a table from an existing file
Time for action – performing a join
Time for action – using views
Time for action – exporting query output
Time for action – making a partitioned UFO sighting table
Time for action – adding a new User Defined Function (UDF)
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
Summary
Chapter 9: Working with Relational Databases
Common data paths
Setting up MySQL
Time for action – installing and setting up MySQL
Time for action – configuring MySQL to allow remote connections
Time for action – setting up the employee database
Getting data into Hadoop
Time for action – downloading and configuring Sqoop
Time for action – exporting data from MySQL to HDFS
Time for action – exporting data from MySQL into Hive
Time for action – a more selective import
Time for action – using a type mapping
Time for action – importing data from a raw query
Getting data out of Hadoop
Time for action – importing data from Hadoop into MySQL
Time for action – importing Hive data into MySQL
Time for action – fixing the mapping and re-running the export
AWS considerations
Summary
Chapter 10: Data Collection with Flume
A note about AWS
Data data everywhere...
Time for action – getting web server data into Hadoop
Introducing Apache Flume
Time for action – installing and configuring Flume
Time for action – capturing network traffic in a log file
Time for action – logging to the console
Time for action – capturing the output of a command to a flat file
Time for action – capturing a remote file in a local flat file
Time for action – writing network traffic onto HDFS
Time for action – adding timestamps
Time for action – multi level Flume networks
Time for action – writing to multiple sinks
The bigger picture
Summary
Chapter 11: Where to Go Next
What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Other Apache projects
Other programming abstractions
AWS resources
Sources of information
Summary

Book Details

ISBN 139781849517300
Paperback398 pages
Read More
From 1 reviews

Read More Reviews

Recommended for You

Oracle Advanced PL/SQL Developer Professional Guide Book Cover
Oracle Advanced PL/SQL Developer Professional Guide
$ 35.99
$ 25.20
Hadoop Real-World Solutions Cookbook Book Cover
Hadoop Real-World Solutions Cookbook
$ 29.99
$ 21.00
Programming with CodeIgniter MVC Book Cover
Programming with CodeIgniter MVC
$ 20.99
$ 14.70
.NET 4.0 Generics Beginner's Guide Book Cover
.NET 4.0 Generics Beginner's Guide
$ 29.99
$ 21.00
R Statistical Application Development by Example Beginner's Guide Book Cover
R Statistical Application Development by Example Beginner's Guide
$ 26.99
$ 18.90
Enterprise Security: A Data-Centric Approach to Securing the Enterprise Book Cover
Enterprise Security: A Data-Centric Approach to Securing the Enterprise
$ 26.99
$ 18.90