Big Data Analysis with Python

The goal of this course is to learn how to use Python and Spark to ingest, process, and analyze large volumes of data with different structures to generate insights and useful metrics from the data, walking through real-life examples and use cases.
Code Files

Big Data Analysis with Python

Ivan Marin, Sarang VK, Ankit Shukla

The goal of this course is to learn how to use Python and Spark to ingest, process, and analyze large volumes of data with different structures to generate insights and useful metrics from the data, walking through real-life examples and use cases.
This title is available to pre-order now and is expected to be published in
Packt Subscription
FREE
$9.99/m after trial
eBook
$19.60
RRP $27.99
Save 29%
Print + eBook
$34.99
RRP $34.99
What do I get with a Packt subscription?
  • Exclusive monthly discount - no contract
  • Unlimited access to entire Packt library of 6500+ eBooks and Videos
  • 120 new titles added every month, on new and emerging tech
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the subscription reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the subscription reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the subscription reader
$0.00
$19.60
$34.99
$9.99 p/m after trial
RRP $27.99
RRP $34.99
Subscription
eBook
Print + eBook
Start a FREE 10-day trial

Frequently bought together


Big Data Analysis with Python Book Cover
Big Data Analysis with Python
$ 27.99
$ 19.60
Data Visualization with Python Book Cover
Data Visualization with Python
$ 27.99
$ 19.60
Buy 2 for $39.20
Save $16.78
Add to Cart

Book Details

ISBN 139781789955286
Paperback403 pages

Book Description

Big Data is here to stay, as more and more companies see the value of storing data generated internally or not. But as with every new technology, it’s not enough to use it if no value is generated from it. Analyzing these datasets is a fundamental step into extracting the locked value in data. In this process, Python has been the most used programming language to process and analyze data, with its easy of use and very rich ecosystem and powerful libraries, and it’s still growing.

This course will cover an introduction to data manipulation in Python using Pandas, with generation of statistics, metrics, and plots. The next step is to do the analysis but now distributed on several computers, using Dask. Data aggregation for plots when all data does not fit into memory will be addressed. For really large problems and datasets, an introduction of Hadoop (HDFS and YARN) will be presented. The rest of the course will focus into Spark and its interaction with the previous tools presented.

By the end of the course, the student will be able to bootstrap its own Python environment, read large files and more data than can fit into memory, connect to Hadoop systems and manipulate data from there, generating statistics, metrics and graphs that represent the information in the dataset.

This approach differs from the more common approaches to Big Data problems that usually try to solve this problem using MapReduce or SQL-over-HDFS tools, such as Hive or Impala. The approach of building from the small case to the distributed one is different, using the similar interfaces between the presented stack to make it easier to understand and achieve the final goal.

Table of Contents

What You Will Learn

  • Read and transform data into different formats using Python
  • Read large volumes of data on disk and manipulate it to generate basic statistics and metrics
  • Handle distributed computing tasks over a cluster or local machines interconnected by a network
  • Convert data from different sources to efficient formats for storage or querying, like Parquet
  • Process, transform and aggregate data to generate clean datasets ready to be used in statistical analysis, visualization, and machine learning
  • Explore data visually, enabling other analysts and decision makers to act on information extracted from data

Authors

Table of Contents

Book Details

ISBN 139781789955286
Paperback403 pages
Read More

Read More Reviews

Recommended for You

Data Visualization with Python Book Cover
Data Visualization with Python
$ 27.99
$ 19.60
Apache Spark 2: Data Processing and Real-Time Analytics Book Cover
Apache Spark 2: Data Processing and Real-Time Analytics
$ 39.99
$ 28.00
Applied Data Science with Python and Jupyter Book Cover
Applied Data Science with Python and Jupyter
$ 23.99
$ 16.80
Complete Vue.js 2 Web Development Book Cover
Complete Vue.js 2 Web Development
$ 39.99
$ 28.00
C# 7 and .NET: Designing Modern Cross-platform Applications Book Cover
C# 7 and .NET: Designing Modern Cross-platform Applications
$ 39.99
$ 28.00
Python: Beginner's Guide to Artificial Intelligence Book Cover
Python: Beginner's Guide to Artificial Intelligence
$ 39.99
$ 28.00