If you’re a seller or doing market research, knowing a product’s ASIN can help you quickly find exact product matches, analyze competitor listings, and stay ahead in the marketplace. This article will show you simple, effective methods to scrape Amazon ASINs at scale. You will also learn about Bright Data’s solution, which can significantly speed up this process.
What is an ASIN on Amazon?
An ASIN is a 10-character code that combines letters and numbers (for example, B07PZF3QK9). Amazon assigns this unique code to every product in its catalogue, from books to electronics to clothing.
There are two simple ways to find any product’s ASIN:
1. Look at the product URL – the ASIN appears right after “/dp/” in the address bar.
2. Scroll down to the product information section on any Amazon listing – you’ll find the ASIN listed there.
How to Extract ASINs from Amazon
Scraping data from Amazon might seem straightforward initially, but it’s quite challenging due to their robust anti-scraping measures. Amazon actively protects against automated data collection through several sophisticated methods:
- CAPTCHA challenges that appear when suspicious activity is detected
- HTTP 503 errors that block access to requested pages
- Frequent website layout changes that break parsing logic
Here’s a screenshot of a typical HTTP 503 error triggered by Amazon:
You can try this simple script to scrape Amazon ASINs:
import asyncio import os from curl_cffi import requests from bs4 import BeautifulSoup from tenacity import retry, stop_after_attempt, wait_random class AsinScraper: def __init__(self): self.session = requests.Session() self.asins = set() def create_url(self, keyword: str, page: int) -> str: return f"https://www.amazon.com/s?k={keyword.replace(' ', '+')}&page={page}" @retry(stop=stop_after_attempt(3), wait=wait_random(min=2, max=5)) async def fetch_page(self, url: str) -> str | None: try: print(f"Fetching URL: {url}") response = self.session.get( url, impersonate="chrome120", timeout=30) print(f"HTTP Status Code: {response.status_code}") if response.st\atus_code == 200: # Check for any block indicators in the response if "Sorry" not in response.text: return response.text else: print("Sorry, request blocked!") else: print(f"Unexpected HTTP status code: {response.status_code}") except Exception as e: print(f"Exception occurred during fetch: {e}") return None def extract_asins(self, html: str) -> set[str]: soup = BeautifulSoup(html, "lxml") containers = soup.find_all( "div", {"data-component-type": "s-search-result"}) new_asins = set() for container in containers: asin = container.get("data-asin") if asin and asin.strip(): new_asins.add(asin) return new_asins def save_to_csv(self, keyword: str): if not self.asins: print("No ASINs to save") return # Create results directory if it doesn't exist os.makedirs("results", exist_ok=True) # Generate filename csv_path = f"results/amazon_asins_{keyword.replace(' ', '_')}.csv" # Save as CSV with open(csv_path, 'w') as f: f.write("asin\n") for asin in sorted(self.asins): f.write(f"{asin}\n") print(f"ASINs saved to: {csv_path}") async def main(): scraper = AsinScraper() keyword = "laptop" max_pages = 5 for page in range(1, max_pages + 1): print(f"Scraping page {page}...") html = await scraper.fetch_page(scraper.create_url(keyword, page)) if not html: print(f"Failed to fetch page {page}") break new_asins = scraper.extract_asins(html) if new_asins: scraper.asins.update(new_asins) print(f"Found {len(new_asins)} ASINs on page { page}. Total ASINs: {len(scraper.asins)}") else: print("No more ASINs found. Ending scrape.") break # Save results to CSV scraper.save_to_csv(keyword) if __name__ == "__main__": asyncio.run(main())
So, what is the solution for scraping Amazon ASINs? The most reliable approach involves using residential proxies from the best proxy providers along with proper HTTP headers.
Using Bright Data Proxies to Scrape Amazon ASINs
Bright Data is a leading proxy provider with a global network of proxies. It offers different types of proxies on both shared and private servers, catering to a wide range of use cases. These servers can route traffic using the HTTP, HTTPS, and SOCKS protocols.
Why Choose Bright Data for Amazon Scraping?
- Vast IP Network: Access to 72M+ IPs across 195 countries
- Precise Geolocation Targeting: Target specific cities, ZIP codes, or even carriers
- Multiple Proxy Types: Choose from residential, datacenter, mobile, or ISP proxies.
- High Reliability: 99.9% success rate with optional 100% uptime
- Flexible Scaling: Pay-as-you-go options available for businesses of all sizes
Setting Up Bright Data for Amazon Scraping
If you want to use Bright Data proxies for Amazon ASIN scraping, follow these simple steps:
Step 1: Sign Up for Bright Data
Visit the Bright Data website and create an account. If you already have an account, proceed to the next step.
Step 2: Create a New Proxy Zone
Log in, go to the Proxy & Scraping Infrastructure section, and click Add to create a new proxy zone. Select Residential proxies, which are the best option for avoiding anti-scraping restrictions as they use real device IPs.
Step 3: Configure Proxy Settings
Choose the regions or countries for browsing. Name your zone appropriately (e.g., “asin_scraping”).
Bright Data allows precise geolocation targeting, down to the city or ZIP code.
Step 4: Complete KYC Verification
For full access to Bright Data’s residential proxies, complete the KYC verification process.
Step 5: Start Using Proxies
Once the proxy zone is created, you’ll see credentials (host, port, username, password) to start scraping.
Yes, it’s that simple!
Implementing the Scraper
Step 1: Setting Up Browser Headers
headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US,en;q=0.9",
"sec-ch-ua": '"Chromium";v="119", "Not?A_Brand";v="24"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
}
Step 2: Configuring Proxy Settings
proxy_config = {
"username": "YOUR_USERNAME",
"password": "YOUR_PASSWORD",
"server": "brd.superproxy.io:33335",
}
proxy_url = f"http://{proxy_config['username']}:{proxy_config['password']}@{proxy_config['server']}"
Step 3: Making Requests
Make a request using headers and proxies with the curl_cffi library:
response = session.get(
url,
headers=headers,
impersonate="chrome120",
proxies={"http": proxy_url, "https": proxy_url},
timeout=30,
verify=False,
)
Note: The curl_cffi
library is an excellent choice for web scraping, offering advanced browser impersonation capabilities that outperform the standard requests
library.
Step 4: Running Your Scraper
To execute your scraper, you’ll need to configure your target keywords. Here is an example:
keywords = [
"coffee maker",
"office desk",
"cctv camera"
]
max_pages = None # Set to None for all pages
Find the complete code here.
The scraper will output results to a CSV file containing:
Using Bright Data Amazon Scraper API to Extract ASINs
While proxy-based scraping works, using a Bright Data Amazon Scraper API offers significant advantages:
- No Infrastructure Management: No need to worry about proxies, IP rotations, or captchas
- Geo-Location Scraping: Scrape from any geographical region
- Simple Integration: Implementation in minutes with any programming language
- Multiple Data Delivery Options:
- Export to Amazon S3, Google Cloud, Azure, Snowflake, or SFTP
- Get data in JSON, NDJSON, CSV, or .gz formats
- GDPR & CCPA Compliant: Ensures privacy compliance for ethical web scraping
- 20 Free API Calls: Test the service before committing
- 24/7 Support: Dedicated support to assist with any API-related questions or issues
Setting Up the Amazon Scraper API
Setting up the API is simple and can be completed in a few steps.
Step 1: Access the API
Navigate to Web Scraper API and search for “amazon products search” under available APIs:
Click “Start setting an API call”:
Step 2: Get Your API Token
Click “Get API token”:
Select “Add token”:
Save your new API token securely:
Step 3: Configure Data Collection
In the Data Collection APIs tab:
- Specify keywords for product search
- Set target Amazon domains
- Define the number of pages to scrape
- Additional filters (optional)
Using the API with Python
Here’s an example Python script to trigger data collection and retrieve results:
import json
import requests
import time
from typing import Dict, List, Optional, Union, Tuple
from datetime import datetime, timedelta
import logging
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from enum import Enum
class SnapshotStatus(Enum):
SUCCESS = "success"
PROCESSING = "processing"
FAILED = "failed"
TIMEOUT = "timeout"
class BrightDataAmazonScraper:
def __init__(self, api_token: str, dataset_id: str):
self.api_token = api_token
self.dataset_id = dataset_id
self.base_url = "https://api.brightdata.com/datasets/v3"
self.headers = {
"Authorization": f"Bearer {api_token}",
"Content-Type": "application/json",
}
# Setup logging with custom format
logging.basicConfig(
level=logging.INFO,
format='%(message)s' # Simplified format to show only messages
)
self.logger = logging.getLogger(__name__)
# Setup session with retry strategy
self.session = self._create_session()
# Track progress
self.last_progress_update = 0
def _create_session(self) -> requests.Session:
"""Create a session with retry strategy"""
session = requests.Session()
retry_strategy = Retry(
total=3,
backoff_factor=0.5,
status_forcelist=[500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("https://", adapter)
session.mount("http://", adapter)
return session
def trigger_collection(self, datasets: List[Dict]) -> Optional[str]:
"""Trigger data collection for specified datasets"""
trigger_url = f"{self.base_url}/trigger?dataset_id={self.dataset_id}"
try:
response = self.session.post(
trigger_url,
headers=self.headers,
json=datasets
)
response.raise_for_status()
snapshot_id = response.json().get("snapshot_id")
if snapshot_id:
self.logger.info("Initializing Amazon data collection...")
return snapshot_id
else:
self.logger.error("Unable to initialize data collection.")
return None
except requests.exceptions.RequestException as e:
self.logger.error(f"Collection initialization failed: {str(e)}")
return None
def check_snapshot_status(self, snapshot_id: str) -> Tuple[SnapshotStatus, Optional[Dict]]:
"""Check the current status of a snapshot"""
snapshot_url = f"{self.base_url}/snapshot/{snapshot_id}?format=json"
try:
response = self.session.get(snapshot_url, headers=self.headers)
if response.status_code == 200:
return SnapshotStatus.SUCCESS, response.json()
elif response.status_code == 202:
return SnapshotStatus.PROCESSING, None
else:
return SnapshotStatus.FAILED, None
except requests.exceptions.RequestException:
return SnapshotStatus.FAILED, None
def wait_for_snapshot_data(
self,
snapshot_id: str,
timeout: Optional[int] = None,
check_interval: int = 10,
max_interval: int = 300,
callback=None
) -> Optional[Dict]:
"""Wait for snapshot data with minimal console output"""
start_time = datetime.now()
current_interval = check_interval
attempts = 0
progress_shown = False
while True:
attempts += 1
if timeout is not None:
elapsed_time = (datetime.now() - start_time).total_seconds()
if elapsed_time >= timeout:
self.logger.error("Data collection exceeded time limit.")
return None
status, data = self.check_snapshot_status(snapshot_id)
if status == SnapshotStatus.SUCCESS:
self.logger.info(
"Amazon data collection completed successfully!")
return data
elif status == SnapshotStatus.FAILED:
self.logger.error("Data collection encountered an error.")
return None
elif status == SnapshotStatus.PROCESSING:
# Show progress indicator only every 30 seconds
current_time = time.time()
if not progress_shown:
self.logger.info("Collecting data from Amazon...")
progress_shown = True
elif current_time - self.last_progress_update >= 30:
self.logger.info("Data collection in progress...")
self.last_progress_update = current_time
if callback:
callback(attempts, (datetime.now() -
start_time).total_seconds())
time.sleep(current_interval)
current_interval = min(current_interval * 1.5, max_interval)
def store_data(self, data: Dict, filename: str = "amazon_data.json") -> None:
"""Store collected data to a JSON file"""
if data:
try:
with open(filename, "w", encoding='utf-8') as file:
json.dump(data, file, indent=4, ensure_ascii=False)
self.logger.info(f"Data saved successfully to {filename}")
except IOError as e:
self.logger.error(f"Error saving data: {str(e)}")
else:
self.logger.warning("No data available to save.")
def progress_callback(attempts: int, elapsed_time: float):
"""Minimal callback function - can be customized based on needs"""
pass # Silent by default
def main():
# Configuration
API_TOKEN = "YOUR_API_TOKEN"
DATASET_ID = "gd_lwdb4vjm1ehb499uxs"
# Initialize scraper
scraper = BrightDataAmazonScraper(API_TOKEN, DATASET_ID)
# Define search parameters
datasets = [
{"keyword": "X-box", "url": "https://www.amazon.com", "pages_to_search": 1},
{"keyword": "PS5", "url": "https://www.amazon.de"},
{"keyword": "car cleaning kit",
"url": "https://www.amazon.es", "pages_to_search": 4},
]
# Execute scraping process
snapshot_id = scraper.trigger_collection(datasets)
if snapshot_id:
data = scraper.wait_for_snapshot_data(
snapshot_id,
timeout=None,
check_interval=10,
max_interval=300,
callback=progress_callback
)
if data:
scraper.store_data(data)
print("\nScraping process completed successfully!\n")
if __name__ == "__main__":
main()
To run this code, make sure to replace the following values:
API_TOKEN
with your actual API token.- Modify the
datasets
list to include the products or keywords you want to search for.
Here’s a sample JSON structure of the data retrieved:
{
"asin": "B0CJ3XWXP8",
"url": "https://www.amazon.com/Xbox-X-Console-Renewed/dp/B0CJ3XWXP8/ref=sr_1_1",
"name": "Xbox Series X Console (Renewed) Xbox Series X Console (Renewed)Sep 15, 2023",
"sponsored": "false",
"initial_price": 449.99,
"final_price": 449.99,
"currency": "USD",
"sold": 2000,
"rating": 4.1,
"num_ratings": 1529,
"variations": null,
"badge": null,
"business_type": null,
"brand": null,
"delivery": ["FREE delivery Sun, Dec 1", "Or fastest delivery Fri, Nov 29"],
"keyword": "X-box",
"image": "https://m.media-amazon.com/images/I/51ojzJk77qL._AC_UY218_.jpg",
"domain": "https://www.amazon.com/",
"bought_past_month": 2000,
"page_number": 1,
"rank_on_page": 1,
"timestamp": "2024-11-26T05:15:24.590Z",
"input": {
"keyword": "X-box",
"url": "https://www.amazon.com",
"pages_to_search": 1,
},
}
You can view the full output by downloading this sample JSON file.
Conclusion
We have discussed the process of collecting Amazon ASINs using Python, but we’ve also faced several challenges along the way. Issues such as CAPTCHAs and rate limits can significantly hinder our data-gathering efforts. As a solution, we can use tools like Bright Data’s proxies or the Amazon Scraper API. These options can help speed up the process and help us bypass common obstacles. If you prefer to avoid the hassle of setting up your scraping tools altogether, Bright Data also offers ready-made Amazon datasets that you can use immediately.
Sign up now and start your free trial!
No credit card required