Archive API
Access Bright Data’s vast cached collections, offering cost-effective HTML discovery from billions of domains. With over 1PB added weekly, stay ahead with the latest data insights. Experience seamless and efficient data retrieval like never before.
- Discover new sources through filterable metadata
- Precisely target by modality, language, or domain
- Curate custom datasets for ongoing or one-off needs
- Optional annotation and labeling services available
Trusted by 20,000+ customers worldwide
Access large-scale web data
Bright Data’s Archive API offers real-time, continuously updated data with advanced filtering and delivery options.
Data Collection
Continuously captures public web data in real time, providing results as recent as “now.”
Data Volume
17.5 PB collected in 8 months, covering 118 billion pages with ~1 PB and 2 billion unique URLs added per week.
Filtering & Delivery
Coverage & Relevance
Archive API Playground
Ready to integrate Web Archive API?
Grab a slice of the Web with Archive API
Retrieve data from a petabyte-scale web archive with billions of HTML pages. Discover video and image URLs, text in 100+ languages, or historical SERPs.
Structured & Clean
Pre-processed data with consistent schemas, perfect for AI model training and inference.
Code Examples
Ready-to-use Python, Node.js, cURL, PHP, Go, Java, and Ruby snippets for easy integration with AI workflows.
Documentation
Comprehensive guides and notebooks for ChatGPT, Claude, and other LLM integrations.
# TTo initiate a search of our Archive, use the following /search endpoint. Endpoint: POST api.brightdata.com/webarchive/search
curl -X POST https://api.brightdata.com/webarchive/search
-H "Authorization: Bearer $API_KEY"
-H 'Content-Type: application/json'
--data '{"filters": {"max_age": "1d", "domain_whitelist": ["example.com"]}}'
# To check the status of a specific query that was made. Endpoint: GET api.brightdata.com/webarchive/search/
curl https://api.brightdata.com/webarchive/search/$SEARCH_ID
-H "Authorization: Bearer $API_KEY"
# Check the status of all current searches. Endpoint: GET api.brightdata.com/webarchive/searches
curl https://api.brightdata.com/webarchive/searches
-H "Authorization: Bearer $API_KEY"
Archive API use cases
Seamless data retrieval from billions of domains
Easily discover and retrieve URLs for video, images, audio and more.

Enterprise-grade infrastructure
Bright Data’s platform powers over 20,000+ companies worldwide, offering peace of mind with 99.99% uptime, access to 150M+ real user IPs covering 195 countries.

Advanced data discovery, collection and processing
Get maximum control and flexibility without maintaining proxy and unblocking infrastructure. Easily scrape data from any geo-location while avoiding CAPTCHAs and blocks.

Tailored to your workflow
Get structured, validated data with customized delivery and integration options, including tailored reports, dashboards, and analytics, across historical crawls and multiple websites.
Industry leading compliance
Our privacy practices comply with data protection laws, including the EU data protection regulatory framework, GDPR, and CCPA – respecting requests to exercise privacy rights and more.
Start collecting web data. Effortlessly.
Archive API FAQs
What is Archive API?
Archive API is a massive, continuously expanding, cached repository by Bright Data, designed to capture and deliver public web data at scale.
It provides full web pages and metadata, making it ideal for AI training, machine learning, and large-scale data analysis.
Unlike traditional web crawls, Archive API prioritizes relevance, freshness, and usability, giving you access to the most important parts of the internet as they are scraped daily.
How much data is available in Bright Data's Archive API?
Bright Data’s Archive API has already collected 17.5 PB of data, covering 28 billion unique URLs from 40 million domains, in the first 8 months of it’s launch alone.
We continue to add ~1 PB of new data every week, alongside ~2 unique billion URLs, making Archive the largest, up-to-date, web data repository available - perfect for AI and data-driven applications.
How quickly can I access archive data?
You can start accessing data immediately through our Archive API. The API allows you to search, retrieve, and filter data snapshots from Archive seamlessly and efficiently.
Data from the last 3 days: Will take from within minutes and up to a few hours to deliver (depending on snapshot size)
Data older than 3 days: Will take from a few hours and up to 3 days to process and deliver (depending on snapshot size)
How can my data be delivered?
Archive offers two delivery options to ensure seamless integration into your existing workflows:
Amazon S3 bucket: Have your Data Snapshot delivered directly to your S3 bucket.
Webhook: Retrieved via webhook for real-time integration into your systems.
Can I filter Archive's data to get only what I need?
Absolutely! Archive API allows filtering by category, domains, date, languages, and country before retrieving data, ensuring you only get what you need.