In this guide, you will learn:
- The definition of batch
- Why to process datasets in batch
- How to split dataset into batches in Python
- The dataset
map()
batched
option approach
Let’s dive in!
What Is a Batch?
In the world of ML and data processing, a batch is nothing more than a subset of a dataset. Batches are typically used to efficiently handle large volumes of data. Instead of processing an entire dataset at once, data is split into smaller chunks, also called batches. Each batch can be processed independently, helping reduce memory usage and improve computational efficiency.
For example, suppose that you have some sample data in CSV format:
id,name,age,score
1,John,28,85
2,Jane,34,90
3,Bob,25,72
4,Alice,30,88
5,Charlie,29,91
6,David,35,79
7,Eve,22,95
8,Frank,31,82
9,Grace,27,86
10,Hannah,26,80
A batch of the above dataset is:
6,David,35,79
7,Eve,22,95
8,Frank,31,82
9,Grace,27,86
10,Hannah,26,80
This batch slices the original dataset from row 5 to 10.
Benefits of Using Processing a Dataset Through Batches
Suppose you have a dataset that you created using some data sourcing techniques. If you are unfamiliar with that process, follow our guide on how to create a dataset.
Now, why would you want to process this dataset in batches? Because that leads to the following benefits:
- Memory efficiency: Work with smaller, more manageable portions at a time—instead of loading the entire dataset into memory
- Faster processing: Batch processing can be parallelized, reducing the time required to process large datasets.
- Better training for ML models: Help in training machine learning models by updating weights incrementally, which can result in a more stable and faster convergence.
- Improved scalability:Make it easier to scale your processing to large datasets that may not fit into memory all at once.
How To Split Dataset Into Batches: Top 5 Approaches
Before exploring the best Python methods for splitting a dataset into batches, we should identify some criteria for evaluating these approaches. Here is a meaningful list of aspects to consider:
- Implementation: A snippet to show how to use the approach in a simple example.
- Scenarios: The real-world situations where the dataset-splitting approach is applicable.
- Input: The types of dataset files and data structures the splitting strategy supports.
- Pros: The benefits provided by the approach.
- Cons: The limitations or downsides of the method.
Time to analyze them one by one!
Approach #1: Array Slicing
Array slicing is a straightforward method for splitting a dataset into smaller, more manageable batches. The idea is to divide a dataset (represented by a list, array, or other sequence) into chunks by slicing it.
👨💻 Implementation:
def create_batches(data, batch_size):
return [data[i:i + batch_size] for i in range(0, len(data), batch_size)]
# example usage
data = list(range(1, 51)) # sample dataset
batches = create_batches(data, batch_size=5)
print(batches)
# output: [[1, 2, 3, 4, 5], ..., [46, 47, 48, 49, 50]]
🎯 Scenarios:
- Data preprocessing tasks where memory limitations are minimal
- Parallel data processing tasks requiring manageable in-memory chunks
- Simple batch processing in data pipelines
🔠 Input:
- Lists, arrays, and tuples in Python.
- Numpy arrays
- CSV data, loaded into memory as a list of rows
- Pandas DataFrames, if converted to lists or arrays
👍 Pros:
- Simple and easy to implement
- Does not require external libraries
- Provides direct control over batch sizes
👎 Cons:
- Limited by available memory
- Doesn’t support extremely large datasets or complex data structures
- Requires custom logic for data shuffling
Approach #2: Generators
Python generators allow you to split a dataset into batches by yielding one batch at a time. If you are not familiar with that mechanism, a generator is a special type of function that behaves like an iterator. Instead of returning data directly, it uses the yield
keyword to produce an iterator object. This give batches the ability to be accessed sequentially using a for
loop or the next()
function.
👨💻 Implementation:
def data_generator(data, batch_size):
for i in range(0, len(data), batch_size):
yield data[i:i + batch_size]
# example usage
data = list(range(1, 51)) # sample dataset
for batch in data_generator(data, batch_size=5):
print(batch)
# output:
# [1, 2, 3, 4, 5]
# ...
# [46, 47, 48, 49, 50]
🎯 Scenarios:
- Batch processing in data pipelines
- Large-scale data preprocessing and augmentation tasks
- From simple to complex batch processing in data pipelines
🔠 Input:
- Lists, arrays, and tuples
- NumPy arrays
- File-based datasets where loading each batch from disk is feasible
👍 Pros:
- Can handle large datasets without full loading them in memory
- Minimal setup and easy to implement
- Enables controlled, on-demand data loading
👎 Cons:
- Limited by data order unless additional shuffling is implemented
- Less effective for dynamic or variable batch sizes
- May not be the best solution for parallel processing, especially on multi-threaded operations
Approach #3: PyTorch DataLoader
The DataLoader
class from PyTorch helps you efficiently split datasets into manageable batches. As a specialized data structure for handling datasets, it also provides useful features like shuffling and parallel data loading.
Note that DataLoader
works with TensorDataset
, another PyTorch data structure meant to represent a dataset. Specifically, a TensorDataset
accepts two arguments:
inps
: The input data, typically in the form of aTensor
.tgts
: The labels or target values, also typically as a tensor, corresponding to the input data.
TensorDataset
pairs the data
and target
together, which can then be loaded by the DataLoader
for batching and training.
👨💻 Implementation:
from torch.utils.data import DataLoader, TensorDataset
import torch
# data to define a simple dataset
inputs = torch.arange(1, 51).float().reshape(-1, 1) # a 1D tensor dataset (input)
targets = inputs ** 2 # square of the input values (simulating a regression task)
# create a TensorDataset and DataLoader
dataset = TensorDataset(inputs, targets)
dataloader = DataLoader(dataset, batch_size=5, shuffle=True)
# iterate through DataLoader
for batch in dataloader:
print(batch)
# sample output:
# [tensor([[46.],
# [42.],
# [25.],
# [10.],
# [34.]]), tensor([[2116.],
# [1764.],
# [ 625.],
# [ 100.],
# [1156.]])]
# ...
# [tensor([[21.],
# [ 9.],
# [ 2.],
# [38.],
# [44.]]), tensor([[ 441.],
# [ 81.],
# [ 4.],
# [1444.],
# [1936.]])]
🎯 Scenarios:
- Training and testing machine learning models in PyTorch
- Shuffling data for unbiased training batches
- Large-scale data processing for deep learning tasks
🔠 Input:
- Custom datasets loaded into PyTorch
TensorDataset
s - Tabular data and numerical arrays in
Tensor
format
👍 Pros:
- Optimized for large datasets with batching and shuffling
- Supports parallel data loading, speeding up batch retrieval
- Works seamlessly with PyTorch models and training loops
- GPU-processing compatible
👎 Cons:
- Requires PyTorch
- Needs data conversion to tensors.
- Not ideal for non-ML batch processing tasks
Approach #4: TensorFlow batch()
Method
The TensorFlow Dataset
batch()
method for splitting datasets into batches. This method divides the dataset into smaller chunks, with features such as parallelization, control over processing order, and naming.
As a machine learning library, TensorFlow also offers additional features like shuffling, repeating, and prefetching.
👨💻 Implementation:
import tensorflow as tf
# create a sample dataset
inputs = tf.range(1, 51, dtype=tf.float32) # a 1D tensor dataset (inputs)
targets = inputs ** 2 # square of input values (simulating a regression task)
# convert the inputs and targets into tf.data.Dataset
inputs_dataset = tf.data.Dataset.from_tensor_slices(inputs)
targets_dataset = tf.data.Dataset.from_tensor_slices(targets)
# create a dataset by zipping the inputs and targets together
dataset = tf.data.Dataset.zip((inputs_dataset, targets_dataset))
# produce a batched dataset
batched_dataset = dataset.batch(batch_size=5)
for batch in batched_dataset:
print(batch)
# output:
# (<tf.Tensor: shape=(5,), dtype=float32, numpy=array([1., 2., 3., 4., 5.], dtype=float32)>, <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1., 4., 9., 16., 25.], dtype=float32)>)
# ...
# (<tf.Tensor: shape=(5,), dtype=float32, numpy=array([46., 47., 48., 49., 50.], dtype=float32)>, <tf.Tensor: shape=(5,), dtype=float32, numpy=array([2116., 2209., 2304., 2401., 2500.], dtype=float32)>)
🎯 Scenarios:
- Training and testing machine learning models in PyTorch
- Shuffling data for unbiased training batches
- Large-scale data processing for deep learning tasks
🔠 Input:
- TensorFlow
tf.data.Dataset
objects - NumPy arrays (which can be converted to
Dataset
) - TFRecord files, a special binary file format commonly used for storing large datasets in TensorFlow
👍 Pros:
- Optimized for efficient memory usage
- Seamlessly integrates with the TensorFlow ecosystem for model training and evaluation
- Supports shuffling, prefetching, and other useful features
- Supports a wide variety of data formats, including images, text, and structured data
👎 Cons:
- Requires TensorFlow
- For more complex datasets, additional setup may be needed to properly format and preprocess the data
- May introduce overhead for batching smaller datasets
Approach #5: HDF5 Format
HDF5 is a widely adopted data format for managing large datasets, especially when dealing with hierarchical data structures. It supports splitting a large dataset into chunks and storing them efficiently.
The h5py
Python library provides tools for working with HDF5 files and loading them as NumPy data structures. This opens the door to batch processing of datasets by accessing specific slices or segments of data on-demand.
👨💻 Implementation:
import h5py
import numpy as np
# load and batch the data from HDF5 file
def load_data_in_batches(batch_size=10):
# open an HDF5 file
with h5py.File("dataset.h5", "r") as f:
inputs = f["input"]
targets = f["target"]
# batching the data with an iterator from the disk
for i in range(0, len(data), batch_size):
yield inputs[i:i+batch_size], targets[i:i+batch_size]
# iterate through batches
for batch_data, batch_target in load_data_in_batches():
print("Input Batch:", batch_input)
print("Targe Batch:", batch_target)
# output:
# Input Batch: [[ 1]
# [ 2]
# [ 3]
# [ 4]
# [ 5]
# [ 6]
# [ 7]
# [ 8]
# [ 9]
# [10]]
# Target Batch: [[ 1]
# [ 4]
# [ 9]
# [ 16]
# [ 25]
# [ 36]
# [ 49]
# [ 64]
# [ 81]
# [100]]
# ...
# Input Batch: [[41]
# [42]
# [43]
# [44]
# [45]
# [46]
# [47]
# [48]
# [49]
# [50]]
# Target Batch: [[1681]
# [1764]
# [1849]
# [1936]
# [2025]
# [2116]
# [2209]
# [2304]
# [2401]
# [2500]]
🎯 Scenarios:
- Ideal for very large datasets that cannot be loaded entirely into memory
- Useful when working with multi-dimensional data
- Suitable for storing and retrieving data from disk in an efficient, compressed format for machine learning tasks
🔠 Input:
- HDF5 files
👍 Pros:
- HDF5 supports data compression and chunking, reducing storage requirements for large datasets
- Allows efficient random access to parts of large datasets without loading everything into memory
- Can store multiple datasets in a single file, making it well-suited for complex datasets.
- Supported by many scientific libraries, including NumPy, TensorFlow, and PyTorch
👎 Cons:
- Requires additional setup and knowledge of the HDF5 format
- For a complete API to deal with HDF5 files, depends on the
h5py
library - Not all datasets are available in HDF5 format
Other Solutions
While the approaches presented above are among the best ways to split a dataset into batches, there are other viable solutions as well.
Another possible solution is to use the Hugging Face datasets
library. This equips you with everything you need to apply transformations to an entire dataset while batch-processing it. By setting batched=True
, you can define batch-level transformations without manually slicing the dataset, as in the example below:
from datasets import load_dataset
# load a sample dataset
dataset = load_dataset("imdb", split="train")
# define a batch processing function
def process_batch(batch):
# simple tokenization task
return {"tokens": [text.split() for text in batch["text"]]}
# apply batch processing
batched_dataset = dataset.map(process_batch, batched=True, batch_size=32)
The dataset
map()
batched=True
option is ideal when you need to apply transformations, such as tokenization, in batches.
Note that using map(batched=True)
is highly efficient for processing batches, as it minimizes memory usage and accelerates transformation workflows. This method is particularly useful for handling text and tabular data in NLP and machine learning tasks.
Conclusion
In this guide on how to split dataset into batches, you explored the best approaches, libraries, and solutions for slicing data in Python. The goal is to divide a large dataset into more manageable parts for simplified and faster data processing.
No matter which approach you choose, all the above solutions depend on having access to a dataset with the data of interest. While some datasets are freely available for scientific research, that is not always the case.
If you need datasets spanning categories from finance to movie data, take a look at Bright Data’s Dataset Marketplace. That provides access to hundreds of datasets from popular sites, categorized into:
- Business Datasets: Data from key sources like LinkedIn, CrunchBase, Owler, and Indeed.
- Ecommerce Datasets: Data from Amazon, Walmart, Target, Zara, Zalando, Asos, and many more.
- Real Estate Datasets: Data from websites such as Zillow, MLS, and more.
- Social Media Datasets: Data from Facebook, Instagram, YouTube, and Reddit.
- Financial Datasets: Data from Yahoo Finance, Market Watch, Investopedia, and more.
If these pre-made options do not meet your needs, consider our custom data collection services.
On top of that, Bright Data offers a wide range of powerful scraping tools, including Web Scraper APIs and Scraping Browser.
Create a Bright Data account for free to start exploring these datasets!
No credit card required