Archive API

Access Bright Data’s vast cached collections, offering cost-effective HTML discovery from billions of domains. With over 1PB added weekly, stay ahead with the latest data insights. Experience seamless and efficient data retrieval like never before.

Tak to an expert
archive api
  • Discover new sources through filterable metadata
  • Precisely target by modality, language, or domain
  • Curate custom datasets for ongoing or one-off needs
  • Optional annotation and labeling services available

Access large-scale web data

Bright Data’s Archive API offers real-time, continuously updated data with advanced filtering and delivery options.

data collection

Data Collection

Continuously captures public web data in real time, providing results as recent as “now.”

data volume

Data Volume

17.5 PB collected in 8 months, covering 118 billion pages with ~1 PB and 2 billion unique URLs added per week.

delivery

Filtering & Delivery

Full discovery and delivery platform- filter by category, domain, language, date etc. Delivered via Amazon S3 or webhook.
coverage

Coverage & Relevance

Archive API focuses on high-value, relevant website data based on real scraping business needs.

Archive API Playground

Demo Web Archive Search
See how our web archive API works with example domains
Demo Domains
example.com
Time Range
Max Age: 1 day
Archive Results
Your archive results will appear here
Click "Show Demo Archive Data" to see example output or configure filters to search
            
          
Code Examples
                
              

Ready to integrate Web Archive API?

Get started with our powerful Web Archive API. Access historical web data with our scalable infrastructure.

Grab a slice of the Web with Archive API

Retrieve data from a petabyte-scale web archive with billions of HTML pages. Discover video and image URLs, text in 100+ languages, or historical SERPs.

structured

Structured & Clean

Pre-processed data with consistent schemas, perfect for AI model training and inference.

code examples

Code Examples

Ready-to-use Python, Node.js, cURL, PHP, Go, Java, and Ruby snippets for easy integration with AI workflows.

documentation

Documentation

Comprehensive guides and notebooks for ChatGPT, Claude, and other LLM integrations.

                              # TTo initiate a search of our Archive, use the following /search endpoint. Endpoint: POST api.brightdata.com/webarchive/search

curl -X POST https://api.brightdata.com/webarchive/search 
  -H "Authorization: Bearer $API_KEY" 
  -H 'Content-Type: application/json' 
  --data '{"filters": {"max_age": "1d", "domain_whitelist": ["example.com"]}}'
                              
                            
                              # To check the status of a specific query that was made. Endpoint: GET api.brightdata.com/webarchive/search/

curl https://api.brightdata.com/webarchive/search/$SEARCH_ID 
  -H "Authorization: Bearer $API_KEY"
                              
                            
                              # Check the status of all current searches. Endpoint: GET api.brightdata.com/webarchive/searches

curl https://api.brightdata.com/webarchive/searches 
  -H "Authorization: Bearer $API_KEY"
                              
                            

Archive API use cases

Track content changes and analyze trends across billions of historical web snapshots. Access 17.5 PB of cached data from 40 million domains for longitudinal studies, competitive analysis, and market intelligence without re-crawling.
Talk to an expert
deep research
Build comprehensive search indices instantly with pre-scraped, JS-rendered content from millions of domains. Filter by category, language, and date to create focused indices while reducing infrastructure costs.
Talk to an expert
Train AI models with 17.5 PB of clean, web data. Get fresh, high-quality content from diverse sources, with 1 PB added weekly, delivered in formats optimized for machine learning applications.
Talk to an expert
data_enrichment_for_ai_models

Seamless data retrieval from billions of domains

Easily discover and retrieve URLs for video, images, audio and more.

FLEXIBLE

Enterprise-grade infrastructure

Bright Data’s platform powers over 20,000+ companies worldwide, offering peace of mind with 99.99% uptime, access to 150M+ real user IPs covering 195 countries.

SCALABLE

Advanced data discovery, collection and processing

Get maximum control and flexibility without maintaining proxy and unblocking infrastructure. Easily scrape data from any geo-location while avoiding CAPTCHAs and blocks.

STABLE

Tailored to your workflow

Get structured, validated data with customized delivery and integration options, including tailored reports, dashboards, and analytics, across historical crawls and multiple websites.

compliance
COMPLIANT

Industry leading compliance

Our privacy practices comply with data protection laws, including the EU data protection regulatory framework, GDPR, and CCPA – respecting requests to exercise privacy rights and more.

Start collecting web data. Effortlessly.

Archive API FAQs

Archive API is a massive, continuously expanding, cached repository by Bright Data, designed to capture and deliver public web data at scale.

It provides full web pages and metadata, making it ideal for AI training, machine learning, and large-scale data analysis.

Unlike traditional web crawls, Archive API prioritizes relevance, freshness, and usability, giving you access to the most important parts of the internet as they are scraped daily.

Bright Data’s Archive API has already collected 17.5 PB of data, covering 28 billion unique URLs from 40 million domains, in the first 8 months of it’s launch alone.

We continue to add ~1 PB of new data every week, alongside ~2 unique billion URLs, making Archive the largest, up-to-date, web data repository available - perfect for AI and data-driven applications.

You can start accessing data immediately through our Archive API. The API allows you to search, retrieve, and filter data snapshots from Archive seamlessly and efficiently.

Data from the last 3 days: Will take from within minutes and up to a few hours to deliver (depending on snapshot size)

Data older than 3 days: Will take from a few hours and up to 3 days to process and deliver (depending on snapshot size)

Archive offers two delivery options to ensure seamless integration into your existing workflows:

Amazon S3 bucket: Have your Data Snapshot delivered directly to your S3 bucket.

Webhook: Retrieved via webhook for real-time integration into your systems.

Absolutely! Archive API allows filtering by category, domains, date, languages, and country before retrieving data, ensuring you only get what you need.