Hadoop: Data Processing and Modelling

Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets
Preview in Mapt

Hadoop: Data Processing and Modelling

Garry Turkington, Tanmay Deshpande, Sandeep Karanth

1 customer reviews
Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets
Mapt Subscription
FREE
$29.99/m after trial
eBook
$49.00
RRP $69.99
Save 29%
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$49.00
$29.99 p/m after trial
RRP $69.99
Subscription
eBook
Start 30 Day Trial

Frequently bought together


Hadoop: Data Processing and Modelling Book Cover
Hadoop: Data Processing and Modelling
$ 69.99
$ 49.00
Learning JavaScript Data Structures and Algorithms - Third Edition Book Cover
Learning JavaScript Data Structures and Algorithms - Third Edition
$ 35.99
$ 25.20
Buy 2 for $35.00
Save $70.98
Add to Cart

Book Details

ISBN 139781787125162
Paperback979 pages

Book Description

As Marc Andreessen has said “Data is eating the world,” which can be witnessed today being the age of Big Data, businesses are producing data in huge volumes every day and this rise in tide of data need to be organized and analyzed in a more secured way. With proper and effective use of Hadoop, you can build new-improved models, and based on that you will be able to make the right decisions.

The first module, Hadoop beginners Guide will walk you through on understanding Hadoop with very detailed instructions and how to go about using it. Commands are explained using sections called “What just happened” for more clarity and understanding. 

The second module, Hadoop Real World Solutions Cookbook, 2nd edition, is an essential tutorial to effectively implement a big data warehouse in your business, where you get detailed practices on the latest technologies such as YARN and Spark.

Big data has become a key basis of competition and the new waves of productivity growth. Hence, once you get familiar with the basics and implement the end-to-end big data use cases, you will start exploring the third module, Mastering Hadoop.

So, now the question is if you need to broaden your Hadoop skill set to the next level after you nail the basics and the advance concepts, then this course is indispensable.
When you finish this course, you will be able to tackle the real-world scenarios and become a big data expert using the tools and the knowledge based on the various step-by-step tutorials and recipes.

Table of Contents

Chapter 1: What It's All About
Big data processing
Cloud computing with Amazon Web Services
Summary
Chapter 2: Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Time for action – checking the prerequisites
Time for action – downloading Hadoop
Time for action – setting up SSH
Time for action – using Hadoop to calculate Pi
Time for action – configuring the pseudo-distributed mode
Time for action – changing the base HDFS directory
Time for action – formatting the NameNode
Time for action – starting Hadoop
Time for action – using HDFS
Time for action – WordCount, the Hello World of MapReduce
Using Elastic MapReduce
Time for action – WordCount on EMR using the management console
Comparison of local versus EMR Hadoop
Summary
Chapter 3: Understanding MapReduce
Key/value pairs
The Hadoop Java API for MapReduce
Writing MapReduce programs
Time for action – setting up the classpath
Time for action – implementing WordCount
Time for action – building a JAR file
Time for action – running WordCount on a local Hadoop cluster
Time for action – running WordCount on EMR
Time for action – WordCount the easy way
Walking through a run of WordCount
Time for action – WordCount with a combiner
Time for action – fixing WordCount to work with a combiner
Hadoop-specific data types
Time for action – using the Writable wrapper classes
Input/output
Summary
Chapter 4: Developing MapReduce Programs
Using languages other than Java with Hadoop
Time for action – implementing WordCount using Streaming
Analyzing a large dataset
Time for action – summarizing the UFO data
Time for action – summarizing the shape data
Time for action – correlating of sighting duration to UFO shape
Time for action – performing the shape/time analysis from the command line
Time for action – using ChainMapper for field validation/analysis
Time for action – using the Distributed Cache to improve location output
Counters, status, and other output
Time for action – creating counters, task states, and writing log output
Summary
Chapter 5: Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
Time for action – reduce-side join using MultipleInputs
Graph algorithms
Time for action – representing the graph
Time for action – creating the source code
Time for action – the first run
Time for action – the second run
Time for action – the third run
Time for action – the fourth and last run
Using language-independent data structures
Time for action – getting and installing Avro
Time for action – defining the schema
Time for action – creating the source Avro data with Ruby
Time for action – consuming the Avro data with Java
Time for action – generating shape summaries in MapReduce
Time for action – examining the output data with Ruby
Time for action – examining the output data with Java
Summary
Chapter 6: When Things Break
Failure
Time for action – killing a DataNode process
Time for action – the replication factor in action
Time for action – intentionally causing missing blocks
Time for action – killing a TaskTracker process
Time for action – killing the JobTracker
Time for action – killing the NameNode process
Time for action – causing task failure
Time for action – handling dirty data by using skip mode
Summary
Chapter 7: Keeping Things Running
A note on EMR
Hadoop configuration properties
Time for action – browsing default properties
Setting up a cluster
Time for action – examining the default rack configuration
Time for action – adding a rack awareness script
Cluster access control
Time for action – demonstrating the default security
Managing the NameNode
Time for action – adding an additional fsimage location
Time for action – swapping to a new NameNode host
Managing HDFS
MapReduce management
Time for action – changing job priorities and killing a job
Scaling
Summary
Chapter 8: A Relational View on Data with Hive
Overview of Hive
Setting up Hive
Time for action – installing Hive
Using Hive
Time for action – creating a table for the UFO data
Time for action – inserting the UFO data
Time for action – validating the table
Time for action – redefining the table with the correct column separator
Time for action – creating a table from an existing file
Time for action – performing a join
Time for action – using views
Time for action – exporting query output
Time for action – making a partitioned UFO sighting table
Time for action – adding a new User Defined Function (UDF)
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
Summary
Chapter 9: Working with Relational Databases
Common data paths
Setting up MySQL
Time for action – installing and setting up MySQL
Time for action – configuring MySQL to allow remote connections
Time for action – setting up the employee database
Getting data into Hadoop
Time for action – downloading and configuring Sqoop
Time for action – exporting data from MySQL to HDFS
Time for action – exporting data from MySQL into Hive
Time for action – a more selective import
Time for action – using a type mapping
Time for action – importing data from a raw query
Getting data out of Hadoop
Time for action – importing data from Hadoop into MySQL
Time for action – importing Hive data into MySQL
Time for action – fixing the mapping and re-running the export
AWS considerations
Summary
Chapter 10: Data Collection with Flume
A note about AWS
Data data everywhere...
Time for action – getting web server data into Hadoop
Introducing Apache Flume
Time for action – installing and configuring Flume
Time for action – capturing network traffic in a log file
Time for action – logging to the console
Time for action – capturing the output of a command to a flat file
Time for action – capturing a remote file in a local flat file
Time for action – writing network traffic onto HDFS
Time for action – adding timestamps
Time for action – multi level Flume networks
Time for action – writing to multiple sinks
The bigger picture
Summary
Chapter 11: Where to Go Next
What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Other Apache projects
Other programming abstractions
AWS resources
Sources of information
Summary
Chapter 12: Getting Started with Hadoop 2.X
Introduction
Installing a single-node Hadoop Cluster
Installing a multi-node Hadoop cluster
Adding new nodes to existing Hadoop clusters
Executing the balancer command for uniform data distribution
Entering and exiting from the safe mode in a Hadoop cluster
Decommissioning DataNodes
Performing benchmarking on a Hadoop cluster
Chapter 13: Exploring HDFS
Introduction
Loading data from a local machine to HDFS
Exporting HDFS data to a local machine
Changing the replication factor of an existing file in HDFS
Setting the HDFS block size for all the files in a cluster
Setting the HDFS block size for a specific file in a cluster
Enabling transparent encryption for HDFS
Importing data from another Hadoop cluster
Recycling deleted data from trash to HDFS
Saving compressed data in HDFS
Chapter 14: Mastering Map Reduce Programs
Introduction
Writing the Map Reduce program in Java to analyze web log data
Executing the Map Reduce program in a Hadoop cluster
Adding support for a new writable data type in Hadoop
Implementing a user-defined counter in a Map Reduce program
Map Reduce program to find the top X
Map Reduce program to find distinct values
Map Reduce program to partition data using a custom partitioner
Writing Map Reduce results to multiple output files
Performing Reduce side Joins using Map Reduce
Unit testing the Map Reduce code using MRUnit
Chapter 15: Data Analysis Using Hive, Pig, and Hbase
Introduction
Storing and processing Hive data in a sequential file format
Storing and processing Hive data in the ORC file format
Storing and processing Hive data in the ORC file format
Storing and processing Hive data in the Parquet file format
Performing FILTER By queries in Pig
Performing Group By queries in Pig
Performing Order By queries in Pig
Performing JOINS in Pig
Writing a user-defined function in Pig
Analyzing web log data using Pig
Performing the Hbase operation in CLI
Performing Hbase operations in Java
Executing the MapReduce programming with an Hbase Table
Chapter 16: Advanced Data Analysis Using Hive
Introduction
Processing JSON data in Hive using JSON SerDe
Processing XML data in Hive using XML SerDe
Processing Hive data in the Avro format
Writing a user-defined function in Hive
Performing table joins in Hive
Executing map side joins in Hive
Performing context Ngram in Hive
Call Data Record Analytics using Hive
Twitter sentiment analysis using Hive
Implementing Change Data Capture using Hive
Multiple table inserting using Hive
Chapter 17: Data Import/Export Using Sqoop and Flume
Introduction
Importing data from RDMBS to HDFS using Sqoop
Exporting data from HDFS to RDBMS
Using query operator in Sqoop import
Importing data using Sqoop in compressed format
Performing Atomic export using Sqoop
Importing data into Hive tables using Sqoop
Importing data into HDFS from Mainframes
Incremental import using Sqoop
Creating and executing Sqoop job
Importing data from RDBMS to Hbase using Sqoop
Importing Twitter data into HDFS using Flume
Importing data from Kafka into HDFS using Flume
Importing web logs data into HDFS using Flume
Chapter 18: Automation of Hadoop Tasks Using Oozie
Introduction
Implementing a Sqoop action job using Oozie
Implementing a Map Reduce action job using Oozie
Implementing a Java action job using Oozie
Implementing a Hive action job using Oozie
Implementing a Pig action job using Oozie
Implementing an e-mail action job using Oozie
Executing parallel jobs using Oozie (fork)
Scheduling a job in Oozie
Chapter 19: Machine Learning and Predictive Analytics Using Mahout and R
Introduction
Setting up the Mahout development environment
Creating an item-based recommendation engine using Mahout
Creating a user-based recommendation engine using Mahout
Using Predictive analytics on Bank Data using Mahout
Clustering text data using K-Means
Performing Population Data Analytics using R
Performing Twitter Sentiment Analytics using R
Performing Predictive Analytics using R
Chapter 20: Integration with Apache Spark
Introduction
Running Spark standalone
Running Spark on YARN
Olympics Athletes analytics using the Spark Shell
Creating Twitter trending topics using Spark Streaming
Twitter trending topics using Spark streaming
Analyzing Parquet files using Spark
Analyzing JSON data using Spark
Processing graphs using Graph X
Conducting predictive analytics using Spark MLib
Chapter 21: Hadoop Use Cases
Introduction
Call Data Record analytics
Web log analytics
Sensitive data masking and encryption using Hadoop
Chapter 22: Hadoop 2.X
The inception of Hadoop
The evolution of Hadoop
Hadoop 2.X
Hadoop distributions
Summary
Chapter 23: Advanced MapReduce
MapReduce input
The RecordReader class
Hadoop's "small files" problem
Filtering inputs
The Map task
The Reduce task
MapReduce output
MapReduce job counters
Handling data joins
Summary
Chapter 24: Advanced Pig
Pig versus SQL
Different modes of execution
Complex data types in Pig
Compiling Pig scripts
Development and debugging aids
The advanced Pig operators
User-defined functions
Pig performance optimizations
Best practices
Summary
Chapter 25: Advanced Hive
The Hive architecture
Data types
File formats
The data model
Hive query optimizers
Advanced DML
UDF, UDAF, and UDTF
Summary
Chapter 26: Serialization and Hadoop I/O
Data serialization in Hadoop
Avro serialization
File formats
Compression
Summary
Chapter 27: YARN – Bringing Other Paradigms to Hadoop
The YARN architecture
Developing YARN applications
Monitoring YARN
Job scheduling in YARN
YARN commands
Summary
Chapter 28: Storm on YARN – Low Latency Processing in Hadoop
Batch processing versus streaming
Apache Storm
Storm on YARN
Summary
Chapter 29: Hadoop on the Cloud
Cloud computing characteristics
Hadoop on the cloud
Amazon Elastic MapReduce (EMR)
Summary
Chapter 30: HDFS Replacements
HDFS – advantages and drawbacks
Amazon AWS S3
Implementing a filesystem in Hadoop
Implementing an S3 native filesystem in Hadoop
Summary
Chapter 31: HDFS Federation
Limitations of the older HDFS architecture
Architecture of HDFS Federation
HDFS high availability
HDFS block placement
Summary
Chapter 32: Hadoop Security
The security pillars
Authentication in Hadoop
Authorization in Hadoop
Data confidentiality in Hadoop
Audit logging in Hadoop
Summary
Chapter 33: Analytics Using Hadoop
Data analytics workflow
Machine learning
Apache Mahout
Document analysis using Hadoop and Mahout
RHadoop
Summary
Chapter 34: Hadoop for Microsoft Windows
Deploying Hadoop on Microsoft Windows
Summary

What You Will Learn

  • Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand
  • Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer
  • Installing and maintaining Hadoop 2.X cluster and its ecosystem
  • Advanced Data Analysis using the Hive, Pig, and Map Reduce programs
  • Machine learning principles with libraries such as Mahout and Batch and Stream data processing using Apache Spark
  • Understand the changes involved in the process in the move from Hadoop 1.0 to Hadoop 2.0
  • Dive into YARN and Storm and use YARN to integrate Storm with Hadoop
  • Deploy Hadoop on Amazon Elastic MapReduce and Discover HDFS replacements and learn about HDFS Federation

Authors

Table of Contents

Chapter 1: What It's All About
Big data processing
Cloud computing with Amazon Web Services
Summary
Chapter 2: Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Time for action – checking the prerequisites
Time for action – downloading Hadoop
Time for action – setting up SSH
Time for action – using Hadoop to calculate Pi
Time for action – configuring the pseudo-distributed mode
Time for action – changing the base HDFS directory
Time for action – formatting the NameNode
Time for action – starting Hadoop
Time for action – using HDFS
Time for action – WordCount, the Hello World of MapReduce
Using Elastic MapReduce
Time for action – WordCount on EMR using the management console
Comparison of local versus EMR Hadoop
Summary
Chapter 3: Understanding MapReduce
Key/value pairs
The Hadoop Java API for MapReduce
Writing MapReduce programs
Time for action – setting up the classpath
Time for action – implementing WordCount
Time for action – building a JAR file
Time for action – running WordCount on a local Hadoop cluster
Time for action – running WordCount on EMR
Time for action – WordCount the easy way
Walking through a run of WordCount
Time for action – WordCount with a combiner
Time for action – fixing WordCount to work with a combiner
Hadoop-specific data types
Time for action – using the Writable wrapper classes
Input/output
Summary
Chapter 4: Developing MapReduce Programs
Using languages other than Java with Hadoop
Time for action – implementing WordCount using Streaming
Analyzing a large dataset
Time for action – summarizing the UFO data
Time for action – summarizing the shape data
Time for action – correlating of sighting duration to UFO shape
Time for action – performing the shape/time analysis from the command line
Time for action – using ChainMapper for field validation/analysis
Time for action – using the Distributed Cache to improve location output
Counters, status, and other output
Time for action – creating counters, task states, and writing log output
Summary
Chapter 5: Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
Time for action – reduce-side join using MultipleInputs
Graph algorithms
Time for action – representing the graph
Time for action – creating the source code
Time for action – the first run
Time for action – the second run
Time for action – the third run
Time for action – the fourth and last run
Using language-independent data structures
Time for action – getting and installing Avro
Time for action – defining the schema
Time for action – creating the source Avro data with Ruby
Time for action – consuming the Avro data with Java
Time for action – generating shape summaries in MapReduce
Time for action – examining the output data with Ruby
Time for action – examining the output data with Java
Summary
Chapter 6: When Things Break
Failure
Time for action – killing a DataNode process
Time for action – the replication factor in action
Time for action – intentionally causing missing blocks
Time for action – killing a TaskTracker process
Time for action – killing the JobTracker
Time for action – killing the NameNode process
Time for action – causing task failure
Time for action – handling dirty data by using skip mode
Summary
Chapter 7: Keeping Things Running
A note on EMR
Hadoop configuration properties
Time for action – browsing default properties
Setting up a cluster
Time for action – examining the default rack configuration
Time for action – adding a rack awareness script
Cluster access control
Time for action – demonstrating the default security
Managing the NameNode
Time for action – adding an additional fsimage location
Time for action – swapping to a new NameNode host
Managing HDFS
MapReduce management
Time for action – changing job priorities and killing a job
Scaling
Summary
Chapter 8: A Relational View on Data with Hive
Overview of Hive
Setting up Hive
Time for action – installing Hive
Using Hive
Time for action – creating a table for the UFO data
Time for action – inserting the UFO data
Time for action – validating the table
Time for action – redefining the table with the correct column separator
Time for action – creating a table from an existing file
Time for action – performing a join
Time for action – using views
Time for action – exporting query output
Time for action – making a partitioned UFO sighting table
Time for action – adding a new User Defined Function (UDF)
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
Summary
Chapter 9: Working with Relational Databases
Common data paths
Setting up MySQL
Time for action – installing and setting up MySQL
Time for action – configuring MySQL to allow remote connections
Time for action – setting up the employee database
Getting data into Hadoop
Time for action – downloading and configuring Sqoop
Time for action – exporting data from MySQL to HDFS
Time for action – exporting data from MySQL into Hive
Time for action – a more selective import
Time for action – using a type mapping
Time for action – importing data from a raw query
Getting data out of Hadoop
Time for action – importing data from Hadoop into MySQL
Time for action – importing Hive data into MySQL
Time for action – fixing the mapping and re-running the export
AWS considerations
Summary
Chapter 10: Data Collection with Flume
A note about AWS
Data data everywhere...
Time for action – getting web server data into Hadoop
Introducing Apache Flume
Time for action – installing and configuring Flume
Time for action – capturing network traffic in a log file
Time for action – logging to the console
Time for action – capturing the output of a command to a flat file
Time for action – capturing a remote file in a local flat file
Time for action – writing network traffic onto HDFS
Time for action – adding timestamps
Time for action – multi level Flume networks
Time for action – writing to multiple sinks
The bigger picture
Summary
Chapter 11: Where to Go Next
What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Other Apache projects
Other programming abstractions
AWS resources
Sources of information
Summary
Chapter 12: Getting Started with Hadoop 2.X
Introduction
Installing a single-node Hadoop Cluster
Installing a multi-node Hadoop cluster
Adding new nodes to existing Hadoop clusters
Executing the balancer command for uniform data distribution
Entering and exiting from the safe mode in a Hadoop cluster
Decommissioning DataNodes
Performing benchmarking on a Hadoop cluster
Chapter 13: Exploring HDFS
Introduction
Loading data from a local machine to HDFS
Exporting HDFS data to a local machine
Changing the replication factor of an existing file in HDFS
Setting the HDFS block size for all the files in a cluster
Setting the HDFS block size for a specific file in a cluster
Enabling transparent encryption for HDFS
Importing data from another Hadoop cluster
Recycling deleted data from trash to HDFS
Saving compressed data in HDFS
Chapter 14: Mastering Map Reduce Programs
Introduction
Writing the Map Reduce program in Java to analyze web log data
Executing the Map Reduce program in a Hadoop cluster
Adding support for a new writable data type in Hadoop
Implementing a user-defined counter in a Map Reduce program
Map Reduce program to find the top X
Map Reduce program to find distinct values
Map Reduce program to partition data using a custom partitioner
Writing Map Reduce results to multiple output files
Performing Reduce side Joins using Map Reduce
Unit testing the Map Reduce code using MRUnit
Chapter 15: Data Analysis Using Hive, Pig, and Hbase
Introduction
Storing and processing Hive data in a sequential file format
Storing and processing Hive data in the ORC file format
Storing and processing Hive data in the ORC file format
Storing and processing Hive data in the Parquet file format
Performing FILTER By queries in Pig
Performing Group By queries in Pig
Performing Order By queries in Pig
Performing JOINS in Pig
Writing a user-defined function in Pig
Analyzing web log data using Pig
Performing the Hbase operation in CLI
Performing Hbase operations in Java
Executing the MapReduce programming with an Hbase Table
Chapter 16: Advanced Data Analysis Using Hive
Introduction
Processing JSON data in Hive using JSON SerDe
Processing XML data in Hive using XML SerDe
Processing Hive data in the Avro format
Writing a user-defined function in Hive
Performing table joins in Hive
Executing map side joins in Hive
Performing context Ngram in Hive
Call Data Record Analytics using Hive
Twitter sentiment analysis using Hive
Implementing Change Data Capture using Hive
Multiple table inserting using Hive
Chapter 17: Data Import/Export Using Sqoop and Flume
Introduction
Importing data from RDMBS to HDFS using Sqoop
Exporting data from HDFS to RDBMS
Using query operator in Sqoop import
Importing data using Sqoop in compressed format
Performing Atomic export using Sqoop
Importing data into Hive tables using Sqoop
Importing data into HDFS from Mainframes
Incremental import using Sqoop
Creating and executing Sqoop job
Importing data from RDBMS to Hbase using Sqoop
Importing Twitter data into HDFS using Flume
Importing data from Kafka into HDFS using Flume
Importing web logs data into HDFS using Flume
Chapter 18: Automation of Hadoop Tasks Using Oozie
Introduction
Implementing a Sqoop action job using Oozie
Implementing a Map Reduce action job using Oozie
Implementing a Java action job using Oozie
Implementing a Hive action job using Oozie
Implementing a Pig action job using Oozie
Implementing an e-mail action job using Oozie
Executing parallel jobs using Oozie (fork)
Scheduling a job in Oozie
Chapter 19: Machine Learning and Predictive Analytics Using Mahout and R
Introduction
Setting up the Mahout development environment
Creating an item-based recommendation engine using Mahout
Creating a user-based recommendation engine using Mahout
Using Predictive analytics on Bank Data using Mahout
Clustering text data using K-Means
Performing Population Data Analytics using R
Performing Twitter Sentiment Analytics using R
Performing Predictive Analytics using R
Chapter 20: Integration with Apache Spark
Introduction
Running Spark standalone
Running Spark on YARN
Olympics Athletes analytics using the Spark Shell
Creating Twitter trending topics using Spark Streaming
Twitter trending topics using Spark streaming
Analyzing Parquet files using Spark
Analyzing JSON data using Spark
Processing graphs using Graph X
Conducting predictive analytics using Spark MLib
Chapter 21: Hadoop Use Cases
Introduction
Call Data Record analytics
Web log analytics
Sensitive data masking and encryption using Hadoop
Chapter 22: Hadoop 2.X
The inception of Hadoop
The evolution of Hadoop
Hadoop 2.X
Hadoop distributions
Summary
Chapter 23: Advanced MapReduce
MapReduce input
The RecordReader class
Hadoop's "small files" problem
Filtering inputs
The Map task
The Reduce task
MapReduce output
MapReduce job counters
Handling data joins
Summary
Chapter 24: Advanced Pig
Pig versus SQL
Different modes of execution
Complex data types in Pig
Compiling Pig scripts
Development and debugging aids
The advanced Pig operators
User-defined functions
Pig performance optimizations
Best practices
Summary
Chapter 25: Advanced Hive
The Hive architecture
Data types
File formats
The data model
Hive query optimizers
Advanced DML
UDF, UDAF, and UDTF
Summary
Chapter 26: Serialization and Hadoop I/O
Data serialization in Hadoop
Avro serialization
File formats
Compression
Summary
Chapter 27: YARN – Bringing Other Paradigms to Hadoop
The YARN architecture
Developing YARN applications
Monitoring YARN
Job scheduling in YARN
YARN commands
Summary
Chapter 28: Storm on YARN – Low Latency Processing in Hadoop
Batch processing versus streaming
Apache Storm
Storm on YARN
Summary
Chapter 29: Hadoop on the Cloud
Cloud computing characteristics
Hadoop on the cloud
Amazon Elastic MapReduce (EMR)
Summary
Chapter 30: HDFS Replacements
HDFS – advantages and drawbacks
Amazon AWS S3
Implementing a filesystem in Hadoop
Implementing an S3 native filesystem in Hadoop
Summary
Chapter 31: HDFS Federation
Limitations of the older HDFS architecture
Architecture of HDFS Federation
HDFS high availability
HDFS block placement
Summary
Chapter 32: Hadoop Security
The security pillars
Authentication in Hadoop
Authorization in Hadoop
Data confidentiality in Hadoop
Audit logging in Hadoop
Summary
Chapter 33: Analytics Using Hadoop
Data analytics workflow
Machine learning
Apache Mahout
Document analysis using Hadoop and Mahout
RHadoop
Summary
Chapter 34: Hadoop for Microsoft Windows
Deploying Hadoop on Microsoft Windows
Summary

Book Details

ISBN 139781787125162
Paperback979 pages
Read More
From 1 reviews

Read More Reviews

Recommended for You

Practical Machine Learning Book Cover
Practical Machine Learning
$ 37.99
$ 26.60
Hadoop 2.x Administration Cookbook Book Cover
Hadoop 2.x Administration Cookbook
$ 39.99
$ 28.00
Hadoop Beginner's Guide Book Cover
Hadoop Beginner's Guide
$ 29.99
$ 21.00
Building Data Streaming Applications with Apache Kafka Book Cover
Building Data Streaming Applications with Apache Kafka
$ 35.99
$ 25.20
Practical Data Analysis Book Cover
Practical Data Analysis
$ 29.99
$ 21.00
DevOps in Finance Book Cover
DevOps in Finance
$ 35.99
$ 25.20