How to Split Datasets into Batches in Python

In this guide, you will learn:

The definition of batch
Why to process datasets in batch
How to split dataset into batches in Python
The dataset map() batched option approach

Let’s dive in!

What Is a Batch?

In the world of ML and data processing, a batch is nothing more than a subset of a dataset. Batches are typically used to efficiently handle large volumes of data. Instead of processing an entire dataset at once, data is split into smaller chunks, also called batches. Each batch can be processed independently, helping reduce memory usage and improve computational efficiency.

For example, suppose that you have some sample data in CSV format:

id,name,age,score
1,John,28,85
2,Jane,34,90
3,Bob,25,72
4,Alice,30,88
5,Charlie,29,91
6,David,35,79
7,Eve,22,95
8,Frank,31,82
9,Grace,27,86
10,Hannah,26,80

A batch of the above dataset is:

6,David,35,79
7,Eve,22,95
8,Frank,31,82
9,Grace,27,86
10,Hannah,26,80

This batch slices the original dataset from row 5 to 10.

Benefits of Using Processing a Dataset Through Batches

Suppose you have a dataset that you created using some data sourcing techniques. If you are unfamiliar with that process, follow our guide on how to create a dataset.

Now, why would you want to process this dataset in batches? Because that leads to the following benefits:

Memory efficiency: Work with smaller, more manageable portions at a time—instead of loading the entire dataset into memory
Faster processing: Batch processing can be parallelized, reducing the time required to process large datasets.
Better training for ML models: Help in training machine learning models by updating weights incrementally, which can result in a more stable and faster convergence.
Improved scalability:Make it easier to scale your processing to large datasets that may not fit into memory all at once.

How To Split Dataset Into Batches: Top 5 Approaches

Before exploring the best Python methods for splitting a dataset into batches, we should identify some criteria for evaluating these approaches. Here is a meaningful list of aspects to consider:

Implementation: A snippet to show how to use the approach in a simple example.
Scenarios: The real-world situations where the dataset-splitting approach is applicable.
Input: The types of dataset files and data structures the splitting strategy supports.
Pros: The benefits provided by the approach.
Cons: The limitations or downsides of the method.

Time to analyze them one by one!

Approach #1: Array Slicing

Array slicing is a straightforward method for splitting a dataset into smaller, more manageable batches. The idea is to divide a dataset (represented by a list, array, or other sequence) into chunks by slicing it.

👨‍💻 Implementation:

def create_batches(data, batch_size):
    return [data[i:i + batch_size] for i in range(0, len(data), batch_size)]

# example usage
data = list(range(1, 51))  # sample dataset
batches = create_batches(data, batch_size=5)

print(batches)

# output: [[1, 2, 3, 4, 5], ..., [46, 47, 48, 49, 50]]

🎯 Scenarios:

Data preprocessing tasks where memory limitations are minimal
Parallel data processing tasks requiring manageable in-memory chunks
Simple batch processing in data pipelines

🔠 Input:

Lists, arrays, and tuples in Python.
Numpy arrays
CSV data, loaded into memory as a list of rows
Pandas DataFrames, if converted to lists or arrays

👍 Pros:

Simple and easy to implement
Does not require external libraries
Provides direct control over batch sizes

👎 Cons:

Limited by available memory
Doesn’t support extremely large datasets or complex data structures
Requires custom logic for data shuffling

Approach #2: Generators

Python generators allow you to split a dataset into batches by yielding one batch at a time. If you are not familiar with that mechanism, a generator is a special type of function that behaves like an iterator. Instead of returning data directly, it uses the yield keyword to produce an iterator object. This give batches the ability to be accessed sequentially using a for loop or the next() function.

👨‍💻 Implementation:

def data_generator(data, batch_size):
    for i in range(0, len(data), batch_size):
        yield data[i:i + batch_size]

# example usage
data = list(range(1, 51))  # sample dataset
for batch in data_generator(data, batch_size=5):
    print(batch)

# output:
# [1, 2, 3, 4, 5]
# ...
# [46, 47, 48, 49, 50]

🎯 Scenarios:

Batch processing in data pipelines
Large-scale data preprocessing and augmentation tasks
From simple to complex batch processing in data pipelines

🔠 Input:

Lists, arrays, and tuples
NumPy arrays
File-based datasets where loading each batch from disk is feasible

👍 Pros:

Can handle large datasets without full loading them in memory
Minimal setup and easy to implement
Enables controlled, on-demand data loading

👎 Cons:

Limited by data order unless additional shuffling is implemented
Less effective for dynamic or variable batch sizes
May not be the best solution for parallel processing, especially on multi-threaded operations

Approach #3: PyTorch DataLoader

The DataLoader class from PyTorch helps you efficiently split datasets into manageable batches. As a specialized data structure for handling datasets, it also provides useful features like shuffling and parallel data loading.

Note that DataLoader works with TensorDataset, another PyTorch data structure meant to represent a dataset. Specifically, a TensorDataset accepts two arguments:

inps: The input data, typically in the form of a Tensor.
tgts: The labels or target values, also typically as a tensor, corresponding to the input data.

TensorDataset pairs the data and target together, which can then be loaded by the DataLoader for batching and training.

👨‍💻 Implementation:

from torch.utils.data import DataLoader, TensorDataset
import torch

# data to define a simple dataset
inputs = torch.arange(1, 51).float().reshape(-1, 1)  # a 1D tensor dataset (input)
targets = inputs ** 2  # square of the input values (simulating a regression task)

# create a TensorDataset and DataLoader
dataset = TensorDataset(inputs, targets)
dataloader = DataLoader(dataset, batch_size=5, shuffle=True)

# iterate through DataLoader
for batch in dataloader:
    print(batch)

# sample output:
# [tensor([[46.],
#         [42.],
#         [25.],
#         [10.],
#         [34.]]), tensor([[2116.],
#         [1764.],
#         [ 625.],
#         [ 100.],
#         [1156.]])]
# ...
# [tensor([[21.],
#         [ 9.],
#         [ 2.],
#         [38.],
#         [44.]]), tensor([[ 441.],
#         [  81.],
#         [   4.],
#         [1444.],
#         [1936.]])]

🎯 Scenarios:

Training and testing machine learning models in PyTorch
Shuffling data for unbiased training batches
Large-scale data processing for deep learning tasks

🔠 Input:

Custom datasets loaded into PyTorch TensorDatasets
Tabular data and numerical arrays in Tensor format

👍 Pros:

Optimized for large datasets with batching and shuffling
Supports parallel data loading, speeding up batch retrieval
Works seamlessly with PyTorch models and training loops
GPU-processing compatible

👎 Cons:

Requires PyTorch
Needs data conversion to tensors.
Not ideal for non-ML batch processing tasks

Approach #4: TensorFlow `batch()` Method

The TensorFlow Dataset batch() method for splitting datasets into batches. This method divides the dataset into smaller chunks, with features such as parallelization, control over processing order, and naming.

As a machine learning library, TensorFlow also offers additional features like shuffling, repeating, and prefetching.

👨‍💻 Implementation:

import tensorflow as tf

# create a sample dataset
inputs = tf.range(1, 51, dtype=tf.float32)  # a 1D tensor dataset (inputs)
targets = inputs ** 2  # square of input values (simulating a regression task)

# convert the inputs and targets into tf.data.Dataset
inputs_dataset = tf.data.Dataset.from_tensor_slices(inputs)
targets_dataset = tf.data.Dataset.from_tensor_slices(targets)

# create a dataset by zipping the inputs and targets together
dataset = tf.data.Dataset.zip((inputs_dataset, targets_dataset))

# produce a batched dataset
batched_dataset = dataset.batch(batch_size=5)

for batch in batched_dataset:
    print(batch)

# output:
# (<tf.Tensor: shape=(5,), dtype=float32, numpy=array([1., 2., 3., 4., 5.], dtype=float32)>, <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1.,  4.,  9., 16., 25.], dtype=float32)>)
# ...
# (<tf.Tensor: shape=(5,), dtype=float32, numpy=array([46., 47., 48., 49., 50.], dtype=float32)>, <tf.Tensor: shape=(5,), dtype=float32, numpy=array([2116., 2209., 2304., 2401., 2500.], dtype=float32)>)

🎯 Scenarios:

Training and testing machine learning models in PyTorch
Shuffling data for unbiased training batches
Large-scale data processing for deep learning tasks

🔠 Input:

TensorFlow tf.data.Dataset objects
NumPy arrays (which can be converted to Dataset)
TFRecord files, a special binary file format commonly used for storing large datasets in TensorFlow

👍 Pros:

Optimized for efficient memory usage
Seamlessly integrates with the TensorFlow ecosystem for model training and evaluation
Supports shuffling, prefetching, and other useful features
Supports a wide variety of data formats, including images, text, and structured data

👎 Cons:

Requires TensorFlow
For more complex datasets, additional setup may be needed to properly format and preprocess the data
May introduce overhead for batching smaller datasets

Approach #5: HDF5 Format

HDF5 is a widely adopted data format for managing large datasets, especially when dealing with hierarchical data structures. It supports splitting a large dataset into chunks and storing them efficiently.

The h5py Python library provides tools for working with HDF5 files and loading them as NumPy data structures. This opens the door to batch processing of datasets by accessing specific slices or segments of data on-demand.

👨‍💻 Implementation:

import h5py
import numpy as np

# load and batch the data from HDF5 file
def load_data_in_batches(batch_size=10):
    # open an HDF5 file
    with h5py.File("dataset.h5", "r") as f:
        inputs = f["input"]
        targets = f["target"]

        # batching the data with an iterator from the disk
        for i in range(0, len(data), batch_size):
            yield inputs[i:i+batch_size], targets[i:i+batch_size]

# iterate through batches
for batch_data, batch_target in load_data_in_batches():
    print("Input Batch:", batch_input)
    print("Targe Batch:", batch_target)

# output:
# Input Batch: [[ 1]
#  [ 2]
#  [ 3]
#  [ 4]
#  [ 5]
#  [ 6]
#  [ 7]
#  [ 8]
#  [ 9]
#  [10]]
# Target Batch: [[  1]
#  [  4]
#  [  9]
#  [ 16]
#  [ 25]
#  [ 36]
#  [ 49]
#  [ 64]
#  [ 81]
#  [100]]
#  ...
# Input Batch: [[41]
#  [42]
#  [43]
#  [44]
#  [45]
#  [46]
#  [47]
#  [48]
#  [49]
#  [50]]
# Target Batch: [[1681]
#  [1764]
#  [1849]
#  [1936]
#  [2025]
#  [2116]
#  [2209]
#  [2304]
#  [2401]
#  [2500]]

🎯 Scenarios:

Ideal for very large datasets that cannot be loaded entirely into memory
Useful when working with multi-dimensional data
Suitable for storing and retrieving data from disk in an efficient, compressed format for machine learning tasks

🔠 Input:

HDF5 files

👍 Pros:

HDF5 supports data compression and chunking, reducing storage requirements for large datasets
Allows efficient random access to parts of large datasets without loading everything into memory
Can store multiple datasets in a single file, making it well-suited for complex datasets.
Supported by many scientific libraries, including NumPy, TensorFlow, and PyTorch

👎 Cons:

Requires additional setup and knowledge of the HDF5 format
For a complete API to deal with HDF5 files, depends on the h5py library
Not all datasets are available in HDF5 format

Conclusion

In this guide on how to split dataset into batches, you explored the best approaches, libraries, and solutions for slicing data in Python. The goal is to divide a large dataset into more manageable parts for simplified and faster data processing.

No matter which approach you choose, all the above solutions depend on having access to a dataset with the data of interest. While some datasets are freely available for scientific research, that is not always the case.

If you need datasets spanning categories from finance to movie data, take a look at Bright Data’s Dataset Marketplace. That provides access to hundreds of datasets from popular sites, categorized into:

Business Datasets: Data from key sources like LinkedIn, CrunchBase, Owler, and Indeed.
Ecommerce Datasets: Data from Amazon, Walmart, Target, Zara, Zalando, Asos, and many more.
Real Estate Datasets: Data from websites such as Zillow, MLS, and more.
Social Media Datasets: Data from Facebook, Instagram, YouTube, and Reddit.
Financial Datasets: Data from Yahoo Finance, Market Watch, Investopedia, and more.

If these pre-made options do not meet your needs, consider our custom data collection services.

On top of that, Bright Data offers a wide range of powerful scraping tools, including Web Scraper APIs and Scraping Browser.

Create a Bright Data account for free to start exploring these datasets!

Start free trial

Start free with Google

Antonello Zanini

Technical Writer

5.5 years experience

Antonello Zanini is a technical writer, editor, and software engineer with 5M+ views. Expert in technical content strategy, web development, and project management.

Expertise

Web Development Web Scraping AI Integration

View all articles

How To Split a Dataset Into Batches With Python

What Is a Batch?

Benefits of Using Processing a Dataset Through Batches

How To Split Dataset Into Batches: Top 5 Approaches

Approach #1: Array Slicing

Approach #2: Generators

Approach #3: PyTorch DataLoader

Approach #4: TensorFlow `batch()` Method

Approach #5: HDF5 Format

Other Solutions

Conclusion

Antonello Zanini

Expertise

Dedicated Scraper APIs & No-Code Scrapers

Just want data? Skip scraping.

You might also be interested in

Best Web Scraping Methods for JavaScript-Heavy Sites

Crawl4AI vs Firecrawl: Detailed Comparison 2025

Using LlamaIndex and Bright Data for Web Search

How To Split a Dataset Into Batches With Python

What Is a Batch?

Benefits of Using Processing a Dataset Through Batches

How To Split Dataset Into Batches: Top 5 Approaches

Approach #1: Array Slicing

Approach #2: Generators

Approach #3: PyTorch DataLoader

Approach #4: TensorFlow batch() Method

Approach #5: HDF5 Format

Other Solutions

Conclusion

Antonello Zanini

Expertise

Dedicated Scraper APIs & No-Code Scrapers

Just want data? Skip scraping.

You might also be interested in

Best Web Scraping Methods for JavaScript-Heavy Sites

Crawl4AI vs Firecrawl: Detailed Comparison 2025

Using LlamaIndex and Bright Data for Web Search

Approach #4: TensorFlow `batch()` Method