How to Scrape Amazon ASIN With Python & Bright Data

Learn how to scrape Amazon ASINs at scale with Python, Bright Data proxies, and APIs.
12 min read
How to Scrape Amazon ASIN blog image

If you’re a seller or doing market research, knowing a product’s ASIN can help you quickly find exact product matches, analyze competitor listings, and stay ahead in the marketplace. This article will show you simple, effective methods to scrape Amazon ASINs at scale. You will also learn about Bright Data’s solution, which can significantly speed up this process.

What is an ASIN on Amazon?

An ASIN is a 10-character code that combines letters and numbers (for example, B07PZF3QK9). Amazon assigns this unique code to every product in its catalogue, from books to electronics to clothing.

There are two simple ways to find any product’s ASIN:

1. Look at the product URL – the ASIN appears right after “/dp/” in the address bar.

The ASIN that appears in the URL of the product

2. Scroll down to the product information section on any Amazon listing – you’ll find the ASIN listed there.

ASIN in the item details under product information

How to Extract ASINs from Amazon

Scraping data from Amazon might seem straightforward initially, but it’s quite challenging due to their robust anti-scraping measures. Amazon actively protects against automated data collection through several sophisticated methods:

  • CAPTCHA challenges that appear when suspicious activity is detected
  • HTTP 503 errors that block access to requested pages
  • Frequent website layout changes that break parsing logic

Here’s a screenshot of a typical HTTP 503 error triggered by Amazon:

A 503 error triggered on Amazon

You can try this simple script to scrape Amazon ASINs:

import asyncio
import os
from curl_cffi import requests
from bs4 import BeautifulSoup
from tenacity import retry, stop_after_attempt, wait_random


class AsinScraper:
    def __init__(self):
        self.session = requests.Session()
        self.asins = set()

    def create_url(self, keyword: str, page: int) -> str:
        return f"https://www.amazon.com/s?k={keyword.replace(' ', '+')}&page={page}"

    @retry(stop=stop_after_attempt(3), wait=wait_random(min=2, max=5))
    async def fetch_page(self, url: str) -> str | None:
        try:
            print(f"Fetching URL: {url}")
            response = self.session.get(
                url, impersonate="chrome120", timeout=30)

            print(f"HTTP Status Code: {response.status_code}")

            if response.st\atus_code == 200:
                # Check for  any block indicators in the response
                if "Sorry" not in response.text:
                    return response.text
                else:
                    print("Sorry, request blocked!")
            else:
                print(f"Unexpected HTTP status code: {response.status_code}")

        except Exception as e:
            print(f"Exception occurred during fetch: {e}")

        return None

    def extract_asins(self, html: str) -> set[str]:
        soup = BeautifulSoup(html, "lxml")
        containers = soup.find_all(
            "div", {"data-component-type": "s-search-result"})

        new_asins = set()
        for container in containers:
            asin = container.get("data-asin")
            if asin and asin.strip():
                new_asins.add(asin)

        return new_asins

    def save_to_csv(self, keyword: str):
        if not self.asins:
            print("No ASINs to save")
            return

        # Create results directory if it doesn't exist
        os.makedirs("results", exist_ok=True)

        # Generate filename
        csv_path = f"results/amazon_asins_{keyword.replace(' ', '_')}.csv"

        # Save as CSV
        with open(csv_path, 'w') as f:
            f.write("asin\n")
            for asin in sorted(self.asins):
                f.write(f"{asin}\n")

        print(f"ASINs saved to: {csv_path}")


async def main():
    scraper = AsinScraper()
    keyword = "laptop"
    max_pages = 5

    for page in range(1, max_pages + 1):
        print(f"Scraping page {page}...")
        html = await scraper.fetch_page(scraper.create_url(keyword, page))

        if not html:
            print(f"Failed to fetch page {page}")
            break

        new_asins = scraper.extract_asins(html)
        if new_asins:
            scraper.asins.update(new_asins)
            print(f"Found {len(new_asins)} ASINs on page {
                  page}. Total ASINs: {len(scraper.asins)}")
        else:
            print("No more ASINs found. Ending scrape.")
            break

    # Save results to CSV
    scraper.save_to_csv(keyword)


if __name__ == "__main__":
    asyncio.run(main())

So, what is the solution for scraping Amazon ASINs? The most reliable approach involves using residential proxies from the best proxy providers along with proper HTTP headers.

Using Bright Data Proxies to Scrape Amazon ASINs

Bright Data is a leading proxy provider with a global network of proxies. It offers different types of proxies on both shared and private servers, catering to a wide range of use cases. These servers can route traffic using the HTTP, HTTPS, and SOCKS protocols.

Why Choose Bright Data for Amazon Scraping?

  1. Vast IP Network: Access to 72M+ IPs across 195 countries
  2. Precise Geolocation Targeting: Target specific cities, ZIP codes, or even carriers
  3. Multiple Proxy Types: Choose from residential, datacenter, mobile, or ISP proxies.
  4. High Reliability: 99.9% success rate with optional 100% uptime
  5. Flexible Scaling: Pay-as-you-go options available for businesses of all sizes

Setting Up Bright Data for Amazon Scraping

If you want to use Bright Data proxies for Amazon ASIN scraping, follow these simple steps:

Step 1: Sign Up for Bright Data

Visit the Bright Data website and create an account. If you already have an account, proceed to the next step.

Step 2: Create a New Proxy Zone

Log in, go to the Proxy & Scraping Infrastructure section, and click Add to create a new proxy zone. Select Residential proxies, which are the best option for avoiding anti-scraping restrictions as they use real device IPs.

Adding a new residential proxies zone under the Proxies & Scraping Infrastructure screen

Step 3: Configure Proxy Settings

Choose the regions or countries for browsing. Name your zone appropriately (e.g., “asin_scraping”).

The basic settings for the new zone

Bright Data allows precise geolocation targeting, down to the city or ZIP code.

Advanced settings that include geographical targeting

Step 4: Complete KYC Verification

For full access to Bright Data’s residential proxies, complete the KYC verification process.

Step 5: Start Using Proxies

Once the proxy zone is created, you’ll see credentials (host, port, username, password) to start scraping.

The zone is ready to use

Yes, it’s that simple!

Implementing the Scraper

Step 1: Setting Up Browser Headers

headers = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US,en;q=0.9",
    "sec-ch-ua": '"Chromium";v="119", "Not?A_Brand";v="24"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": '"Windows"',
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
}

Step 2: Configuring Proxy Settings

proxy_config = {
    "username": "YOUR_USERNAME",
    "password": "YOUR_PASSWORD",
    "server": "brd.superproxy.io:33335",
}

proxy_url = f"http://{proxy_config['username']}:{proxy_config['password']}@{proxy_config['server']}"

Step 3: Making Requests

Make a request using headers and proxies with the curl_cffi library:

response = session.get(
    url,
    headers=headers,
    impersonate="chrome120",
    proxies={"http": proxy_url, "https": proxy_url},
    timeout=30,
    verify=False,
)

Note: The curl_cffi library is an excellent choice for web scraping, offering advanced browser impersonation capabilities that outperform the standard requests library.

Step 4: Running Your Scraper

To execute your scraper, you’ll need to configure your target keywords. Here is an example:

keywords = [
    "coffee maker",
    "office desk",
    "cctv camera"
]
max_pages = None  # Set to None for all pages

Find the complete code here.

The scraper will output results to a CSV file containing:

A CSV file with the scraped ASINs

Using Bright Data Amazon Scraper API to Extract ASINs

While proxy-based scraping works, using a Bright Data Amazon Scraper API offers significant advantages:

  • No Infrastructure Management: No need to worry about proxies, IP rotations, or captchas
  • Geo-Location Scraping: Scrape from any geographical region
  • Simple Integration: Implementation in minutes with any programming language
  • Multiple Data Delivery Options:
    • Export to Amazon S3, Google Cloud, Azure, Snowflake, or SFTP
    • Get data in JSON, NDJSON, CSV, or .gz formats
  • GDPR & CCPA Compliant: Ensures privacy compliance for ethical web scraping
  • 20 Free API Calls: Test the service before committing
  • 24/7 Support: Dedicated support to assist with any API-related questions or issues

Setting Up the Amazon Scraper API

Setting up the API is simple and can be completed in a few steps.

Step 1: Access the API

Navigate to Web Scraper API and search for “amazon products search” under available APIs:

Finding the Amazon API under the Web Scraper API

Click “Start setting an API call”:

Setting up an API call

Step 2: Get Your API Token

Click “Get API token”:

Getting a new API token

Select “Add token”:

Adding a new token for your account

Save your new API token securely:

Save the new API token

Step 3: Configure Data Collection

In the Data Collection APIs tab:

  1. Specify keywords for product search
  2. Set target Amazon domains
  3. Define the number of pages to scrape
  4. Additional filters (optional)
Specification of keywords you are interested in

Using the API with Python

Here’s an example Python script to trigger data collection and retrieve results:

import json
import requests
import time
from typing import Dict, List, Optional, Union, Tuple
from datetime import datetime, timedelta
import logging
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from enum import Enum

class SnapshotStatus(Enum):
    SUCCESS = "success"
    PROCESSING = "processing"
    FAILED = "failed"
    TIMEOUT = "timeout"

class BrightDataAmazonScraper:
    def __init__(self, api_token: str, dataset_id: str):
        self.api_token = api_token
        self.dataset_id = dataset_id
        self.base_url = "https://api.brightdata.com/datasets/v3"
        self.headers = {
            "Authorization": f"Bearer {api_token}",
            "Content-Type": "application/json",
        }

        # Setup logging with custom format
        logging.basicConfig(
            level=logging.INFO,
            format='%(message)s'  # Simplified format to show only messages
        )
        self.logger = logging.getLogger(__name__)

        # Setup session with retry strategy
        self.session = self._create_session()

        # Track progress
        self.last_progress_update = 0

    def _create_session(self) -> requests.Session:
        """Create a session with retry strategy"""
        session = requests.Session()
        retry_strategy = Retry(
            total=3,
            backoff_factor=0.5,
            status_forcelist=[500, 502, 503, 504]
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("https://", adapter)
        session.mount("http://", adapter)
        return session

    def trigger_collection(self, datasets: List[Dict]) -> Optional[str]:
        """Trigger data collection for specified datasets"""
        trigger_url = f"{self.base_url}/trigger?dataset_id={self.dataset_id}"

        try:
            response = self.session.post(
                trigger_url,
                headers=self.headers,
                json=datasets
            )
            response.raise_for_status()

            snapshot_id = response.json().get("snapshot_id")
            if snapshot_id:
                self.logger.info("Initializing Amazon data collection...")
                return snapshot_id
            else:
                self.logger.error("Unable to initialize data collection.")
                return None

        except requests.exceptions.RequestException as e:
            self.logger.error(f"Collection initialization failed: {str(e)}")
            return None

    def check_snapshot_status(self, snapshot_id: str) -> Tuple[SnapshotStatus, Optional[Dict]]:
        """Check the current status of a snapshot"""
        snapshot_url = f"{self.base_url}/snapshot/{snapshot_id}?format=json"

        try:
            response = self.session.get(snapshot_url, headers=self.headers)

            if response.status_code == 200:
                return SnapshotStatus.SUCCESS, response.json()
            elif response.status_code == 202:
                return SnapshotStatus.PROCESSING, None
            else:
                return SnapshotStatus.FAILED, None

        except requests.exceptions.RequestException:
            return SnapshotStatus.FAILED, None

    def wait_for_snapshot_data(
        self,
        snapshot_id: str,
        timeout: Optional[int] = None,
        check_interval: int = 10,
        max_interval: int = 300,
        callback=None
    ) -> Optional[Dict]:
        """Wait for snapshot data with minimal console output"""
        start_time = datetime.now()
        current_interval = check_interval
        attempts = 0
        progress_shown = False

        while True:
            attempts += 1

            if timeout is not None:
                elapsed_time = (datetime.now() - start_time).total_seconds()
                if elapsed_time >= timeout:
                    self.logger.error("Data collection exceeded time limit.")
                    return None

            status, data = self.check_snapshot_status(snapshot_id)

            if status == SnapshotStatus.SUCCESS:
                self.logger.info(
                    "Amazon data collection completed successfully!")
                return data

            elif status == SnapshotStatus.FAILED:
                self.logger.error("Data collection encountered an error.")
                return None

            elif status == SnapshotStatus.PROCESSING:
                # Show progress indicator only every 30 seconds
                current_time = time.time()
                if not progress_shown:
                    self.logger.info("Collecting data from Amazon...")
                    progress_shown = True
                elif current_time - self.last_progress_update >= 30:
                    self.logger.info("Data collection in progress...")
                    self.last_progress_update = current_time

                if callback:
                    callback(attempts, (datetime.now() -
                             start_time).total_seconds())

                time.sleep(current_interval)
                current_interval = min(current_interval * 1.5, max_interval)

    def store_data(self, data: Dict, filename: str = "amazon_data.json") -> None:
        """Store collected data to a JSON file"""
        if data:
            try:
                with open(filename, "w", encoding='utf-8') as file:
                    json.dump(data, file, indent=4, ensure_ascii=False)
                self.logger.info(f"Data saved successfully to {filename}")
            except IOError as e:
                self.logger.error(f"Error saving data: {str(e)}")
        else:
            self.logger.warning("No data available to save.")

def progress_callback(attempts: int, elapsed_time: float):
    """Minimal callback function - can be customized based on needs"""
    pass  # Silent by default

def main():
    # Configuration
    API_TOKEN = "YOUR_API_TOKEN"
    DATASET_ID = "gd_lwdb4vjm1ehb499uxs"

    # Initialize scraper
    scraper = BrightDataAmazonScraper(API_TOKEN, DATASET_ID)

    # Define search parameters
    datasets = [
        {"keyword": "X-box", "url": "https://www.amazon.com", "pages_to_search": 1},
        {"keyword": "PS5", "url": "https://www.amazon.de"},
        {"keyword": "car cleaning kit",
            "url": "https://www.amazon.es", "pages_to_search": 4},
    ]

    # Execute scraping process
    snapshot_id = scraper.trigger_collection(datasets)
    if snapshot_id:
        data = scraper.wait_for_snapshot_data(
            snapshot_id,
            timeout=None,
            check_interval=10,
            max_interval=300,
            callback=progress_callback
        )

        if data:
            scraper.store_data(data)
            print("\nScraping process completed successfully!\n")

if __name__ == "__main__":
    main()

To run this code, make sure to replace the following values:

  1. API_TOKEN with your actual API token.
  2. Modify the datasets list to include the products or keywords you want to search for.

Here’s a sample JSON structure of the data retrieved:

{
    "asin": "B0CJ3XWXP8",
    "url": "https://www.amazon.com/Xbox-X-Console-Renewed/dp/B0CJ3XWXP8/ref=sr_1_1",
    "name": "Xbox Series X Console (Renewed) Xbox Series X Console (Renewed)Sep 15, 2023",
    "sponsored": "false",
    "initial_price": 449.99,
    "final_price": 449.99,
    "currency": "USD",
    "sold": 2000,
    "rating": 4.1,
    "num_ratings": 1529,
    "variations": null,
    "badge": null,
    "business_type": null,
    "brand": null,
    "delivery": ["FREE delivery Sun, Dec 1", "Or fastest delivery Fri, Nov 29"],
    "keyword": "X-box",
    "image": "https://m.media-amazon.com/images/I/51ojzJk77qL._AC_UY218_.jpg",
    "domain": "https://www.amazon.com/",
    "bought_past_month": 2000,
    "page_number": 1,
    "rank_on_page": 1,
    "timestamp": "2024-11-26T05:15:24.590Z",
    "input": {
        "keyword": "X-box",
        "url": "https://www.amazon.com",
        "pages_to_search": 1,
    },
}

You can view the full output by downloading this sample JSON file.

Conclusion

We have discussed the process of collecting Amazon ASINs using Python, but we’ve also faced several challenges along the way. Issues such as CAPTCHAs and rate limits can significantly hinder our data-gathering efforts. As a solution, we can use tools like Bright Data’s proxies or the Amazon Scraper API. These options can help speed up the process and help us bypass common obstacles. If you prefer to avoid the hassle of setting up your scraping tools altogether, Bright Data also offers ready-made Amazon datasets that you can use immediately.

Sign up now and start your free trial!

No credit card required