Data Firehose & Web Archive Pricing

Stream real-time public web records with Data Firehose, or access 90PB+ of cached pages with Web Archive.

DATA FIREHOSE (LAST 24 HRS)
$0.2/ 1K HTMLs
Talk to a data expert
Includes:
  • Fresh data - up to 24 hours
  • ~1 hour delivery (depending on snapshot size)
  • API access with advanced filtering (domains, categories, dates, languages, countries, paths)
  • Flexible delivery: Amazon S3, Azure Blob Storage, Webhook)
  • 24/7 support
  • Volume discounts for large-scale data needs
Best for: Continuously refreshed data pipelines
Common use case:
  • AI search & analytics pipelines requiring up-to-date content
  • Aggregated Web Unlocker and SERP cache - updated hourly
  • Continuous web monitoring and indexing
Archived data (Over 24 hrs)
$1/ 1K HTMLs
Talk to a data expert
Includes:
  • Historical data - Over 24 hours
  • Minimum 2 days delivery (depending on snapshot size)
  • API access with advanced filtering (domains, categories, dates, languages, countries, paths)
  • Flexible delivery: Amazon S3, Azure Blob Storage, Webhook
  • 24/7 support
  • Volume discounts for large-scale data needs
Best for: Historical data at scale
Common use case:
  • AI model training data backfilling at scale
  • Reproducible historical snapshots for research & indexing
  • Auditing or analyzing past web content across domains
* Volume discounts are available for large data volumes, long-term commitments, or multiple scraper projects
We accept these payment methods:

Customer favorite features

  • Petabyte-scale repository
  • Full HTML pages & metadata
  • Advanced filtering & search
  • ~2.5 PB added daily
  • Text, images, video and audio
  • Flexible delivery options
  • 5T+ text tokens added daily
  • API-first access
  • AI-ready data
  • 2.5B+ image/video URLs added daily
  • Maintenance-free
  • 99.99% uptime + 24/7 support
STREAMLINED

Payments with AWS Marketplace

Leverage your purchases to meet your AWS commitments and enjoy streamlined procurement and invoicing all in one place. Benefit from AWS’s robust validation and compliance checks for partners.

COMPLIANT

Industry Leading Compliance

Our privacy practices comply with data protection laws, including the EU data protection regulatory framework, GDPR, and CCPA – respecting requests to exercise privacy rights and more.

Not sure what you need?

Archive API FAQ

Data Firehose delivers a continuous, real-time stream of live web data as it is collected (~1 billion records ingested daily), which is ideal for active monitoring, price tracking, and ongoing AI training pipelines. The Web Archive provides access to a massive historical repository of over 624 billion cached web pages (90PB+), making it perfect for deep research, backtesting, and longitudinal analysis. Many enterprise teams use both: Firehose for fresh signals and Archive for historical context.

You can start accessing data immediately through our Data Firehose. The API allows you to search, retrieve, and filter data snapshots seamlessly and efficiently.

  • Data from the last 1 days: Will take from within minutes and up to a few hours to deliver (depending on snapshot size)
  • Data older than 1 days: Will take from a few hours and up to 3 days to process and deliver (depending on snapshot size)

Archive offers two delivery options to ensure seamless integration into your existing workflows:

  • Amazon S3 bucket: Have your Data Snapshot delivered directly to your S3 bucket.
  • Webhook: Retrieved via webhook for real-time integration into your systems.

Absolutely! Both Data Firehose and Archive API allow filtering by category, domains, date, languages, and country before retrieving data, ensuring you only get what you need.

No, standard delivery methods are included in your cost. For both Data Firehose and Web Archive, you can choose to have your data delivered directly to an Amazon S3 bucket or retrieved via Webhook for seamless integration into your existing systems. Data Firehose also supports immediate, continuous streaming.

No, custom filtering is a core capability, not a paid add-on. We encourage strict filtering by category, domain, date, language, and country. By thoroughly scoping your stream or archive retrieval, you actually reduce the total volume of irrelevant records sent to you, which optimizes your overall data costs.

When working with large-scale web data, freshness, relevance, and accessibility are key. While Common Crawl provides a broad historical snapshot of the web, Bright Data's Archive API offers real-time, continuously updated data with advanced filtering and delivery options. Here's how they compare:

Feature Bright Data's Archive Common Crawl
Data Collection Continuously captures public web data in real time, providing results as recent as "now." Periodic web crawling (not real-time), updated monthly or bimonthly. Data can be outdated
Data Volume 17.5 PB collected in 8 months, covering 118 billion pages (28 billion unique URLs from 40 million domains). Adds ~2.5 PBs and billions of unique URLs/week. 250b pages collected over 18 years.
Website Coverage & Relevance Focuses on high-value, relevant website data based on real scraping business needs. Crawls indiscriminately, including outdated or low-quality pages.
Data Types Full web pages (JS-rendered) 98.6% HTML and text
Filtering & Delivery Full discovery and delivery platform- filtering by category, domain, language, date etc. Delivered via Amazon S3 or webhook. No built-in filtering or delivery. Need to manually process huge raw WARC files.