4.6 out of five star rating on Trustpilot

Web Archive API

Access Bright Data’s vast cached collections, offering cost-effective HTML discovery from billions of domains. With over 1PB added weekly, stay ahead with the latest data insights. Experience seamless and efficient data retrieval like never before.

Talk to an expert

Discover new sources through filterable metadata
Precisely target by modality, language, or domain
Curate custom datasets for ongoing or one-off needs
Optional annotation and labeling services available

Trusted by 20,000+ customers worldwide

Access large-scale web data

Bright Data’s Archive API offers real-time, continuously updated data with advanced filtering and delivery options.

Data Collection

Continuously captures public web data in real time, providing results as recent as “now.”

Data Volume

17.5 PB collected in 8 months, covering 118 billion pages with ~1 PB and 2 billion unique URLs added per week.

Filtering & Delivery

Full discovery and delivery platform- filter by category, domain, language, date etc. Delivered via Amazon S3 or webhook.

Coverage & Relevance

Archive API focuses on high-value, relevant website data based on real scraping business needs.

Archive API Playground

Demo Web Archive Search

See how our web archive API works with example domains

Demo Domains

example.com

Time Range

Max Age: 1 day

Archive Results

Your archive results will appear here

Click "Show Demo Archive Data" to see example output or configure filters to search

Code Examples

Ready to integrate SERP API?

Get started with our powerful SERP API. Access real-time search results from Google, Bing, and more

Get Started Free Documentation Postman

Grab a slice of the Web with Archive API

Retrieve data from a petabyte-scale web archive with billions of HTML pages. Discover video and image URLs, text in 100+ languages, or historical SERPs.

Structured & Clean

Pre-processed data with consistent schemas, perfect for AI model training and inference.

Code Examples

Ready-to-use Python, Node.js, cURL, PHP, Go, Java, and Ruby snippets for easy integration with AI workflows.

Documentation

Comprehensive guides and notebooks for ChatGPT, Claude, and other LLM integrations.

                              # TTo initiate a search of our Archive, use the following /search endpoint. Endpoint: POST api.brightdata.com/webarchive/search

curl -X POST https://api.brightdata.com/webarchive/search 
  -H "Authorization: Bearer $API_KEY" 
  -H 'Content-Type: application/json' 
  --data '{"filters": {"max_age": "1d", "domain_whitelist": ["example.com"]}}'

                              # To check the status of a specific query that was made. Endpoint: GET api.brightdata.com/webarchive/search/

curl https://api.brightdata.com/webarchive/search/$SEARCH_ID 
  -H "Authorization: Bearer $API_KEY"

                              # Check the status of all current searches. Endpoint: GET api.brightdata.com/webarchive/searches

curl https://api.brightdata.com/webarchive/searches 
  -H "Authorization: Bearer $API_KEY"

Archive API use cases

Track content changes and analyze trends across billions of historical web snapshots. Access 17.5 PB of cached data from 40 million domains for longitudinal studies, competitive analysis, and market intelligence without re-crawling.

Talk to an expert

Build comprehensive search indices instantly with pre-scraped, JS-rendered content from millions of domains. Filter by category, language, and date to create focused indices while reducing infrastructure costs.

Talk to an expert

Train AI models with 17.5 PB of clean, web data. Get fresh, high-quality content from diverse sources, with 1 PB added weekly, delivered in formats optimized for machine learning applications.

Talk to an expert

Seamless data retrieval from billions of domains

Easily discover and retrieve URLs for video, images, audio and more.

FLEXIBLE

Enterprise-grade infrastructure

Bright Data’s platform powers over 20,000+ companies worldwide, offering peace of mind with 99.99% uptime, access to 150M+ real user IPs covering 195 countries.

SCALABLE

Advanced data discovery, collection and processing

Get maximum control and flexibility without maintaining proxy and unblocking infrastructure. Easily scrape data from any geo-location while avoiding CAPTCHAs and blocks.

STABLE

Tailored to your workflow

Get structured, validated data with customized delivery and integration options, including tailored reports, dashboards, and analytics, across historical crawls and multiple websites.

COMPLIANT

Industry leading compliance

Our privacy practices comply with data protection laws, including the EU data protection regulatory framework, GDPR, and CCPA – respecting requests to exercise privacy rights and more.

Start collecting web data. Effortlessly.

Talk to an expert

Archive API FAQs

What is Archive API?

Archive API is a massive, continuously expanding, cached repository by Bright Data, designed to capture and deliver public web data at scale.

It provides full web pages and metadata, making it ideal for AI training, machine learning, and large-scale data analysis.

Unlike traditional web crawls, Archive API prioritizes relevance, freshness, and usability, giving you access to the most important parts of the internet as they are scraped daily.

How much data is available in Bright Data's Archive API?

Bright Data’s Archive API has already collected 17.5 PB of data, covering 28 billion unique URLs from 40 million domains, in the first 8 months of it’s launch alone.

We continue to add ~1 PB of new data every week, alongside ~2 unique billion URLs, making Archive the largest, up-to-date, web data repository available - perfect for AI and data-driven applications.

How quickly can I access archive data?

You can start accessing data immediately through our Archive API. The API allows you to search, retrieve, and filter data snapshots from Archive seamlessly and efficiently.

Data from the last 3 days: Will take from within minutes and up to a few hours to deliver (depending on snapshot size)

Data older than 3 days: Will take from a few hours and up to 3 days to process and deliver (depending on snapshot size)

How can my data be delivered?

Archive offers two delivery options to ensure seamless integration into your existing workflows:

Amazon S3 bucket: Have your Data Snapshot delivered directly to your S3 bucket.

Webhook: Retrieved via webhook for real-time integration into your systems.

Can I filter Archive's data to get only what I need?

Absolutely! Archive API allows filtering by category, domains, date, languages, and country before retrieving data, ensuring you only get what you need.