Blog / AI
AI

Building an AI-Powered Amazon Product Analyzer with Bright Data, Gemini, and Streamlit

Discover how to create an AI-driven Amazon product analyzer app using Bright Data’s API, Gemini AI, and Streamlit for powerful insights and dashboards.
17 min read
Amazon Product Analyzer with Bright Data, Gemini, and Streamlit

Tired of manually comparing products on Amazon? Want to ask AI questions about your search results? Need better insights than just sorting by price or rating? In this tutorial, you’ll build an Amazon product analyzer that searches any of 23 Amazon marketplaces, analyzes the results with AI, and presents insights through interactive dashboards.

What you’ll build

By the end of this guide, you’ll have a functional web app that fetches Amazon product data and organizes it into an easy-to-browse dashboard with AI-powered insights.

amazon-product-analytics-bright-data

Core features and user workflow

Here’s how it works:

  1. Search and data collection. Select one of 23 Amazon marketplaces (United States, Germany, Japan, etc.) and enter a product keyword like “wireless headphones”. The app uses Bright Data’s Web Scraper API to collect product information.
  2. Organized results display. Data is presented through a clean, tab-based interface:
    • Recommendations. View products ranked by a custom scoring algorithm (combines rating, review count, and discounts) with 3 categories: “Best Overall Value”, “Highest Rated”, and “Best Deals”.
    • Market analysis. Explore interactive charts showing price distributions and rating patterns to understand the product landscape.
    • AI assistant. Ask questions in plain English, like “Which products have the highest ratings under $100?”. The AI analyzes your current search results and provides answers with product citations.
    • Product results. Browse, sort, and export the complete dataset to CSV for further analysis.

Now that we’ve seen what the app does, let’s look at the technologies that make it possible.

The technology stack and project architecture

Our app uses a modern, Python-based stack, with each component chosen for its specific strengths in data handling, AI, and web development.

Component Technology Purpose
Data source Bright Data Amazon Scraper API Reliable, enterprise-scale Amazon data collection without the hassle of managing proxies or solving CAPTCHAs.
Frontend Streamlit Rapidly build an interactive and beautiful web dashboard using only Python.
AI integration Google Gemini Natural-language insights, data summarization, and AI assistant functionality.
Data processing Pandas The cornerstone for all data cleaning, transformation, and analysis.
Mathematical operations NumPy Value scoring algorithms and statistical computations.
Visualizations Plotly Rich, interactive charts and graphs that users can explore.
HTTP(S) & retries Requests + Tenacity Robust and resilient communication with external APIs.

Project architecture

The project is organized into a modular structure to ensure clean separation of concerns, making the code easier to maintain and extend.

├── streamlit_app.py         # Main Streamlit application entry point
├── requirements.txt         # Project dependencies
├── .env                     # API keys and environment variables (private)
└── amazon_analytics/        # Core application logic module
    ├── __init__.py          # Package initialization
    ├── api.py               # Bright Data API integration
    ├── data_processor.py    # Data cleaning, normalization, and feature engineering
    ├── shopping_intelligence.py # Product recommendation and scoring engine
    ├── gemini_ai_engine.py  # AI analysis and prompt engineering with Gemini
    ├── ai_engine_interface.py   # Abstract AI engine interface
    ├── ai_response.py       # Standardized AI response objects
    └── config.py            # Configuration management

With the architecture clear, let’s get your development environment ready.

Prerequisites

Before we start coding, make sure you have the following ready:

  • Python 3.8+. If you don’t have it installed, download it from the official Python website.
  • A Bright Data account. You’ll need an API key to access the Amazon Scraper API. Sign up for a free trial to get started and generate your API key.
  • A Google API key. This is required for using the Gemini AI model. You can generate one from the Google AI Studio.
  • Basic knowledge. Familiarity with Python, Pandas, and the concept of web APIs will be helpful.

Once you have these, we can proceed with setting up the project.

Step 1 – setting up your development environment

First, let’s clone the project repository, create a virtual environment to isolate our dependencies, and install the required packages.

Installation

Open your terminal and run the following commands:

# Clone the repository
git clone https://github.com/triposat/amazon-product-analytics.git
cd amazon-product-analytics

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

# Install the required libraries
pip install -r requirements.txt

API key configuration

Next, create a .env file in the root directory of the project to securely store your API keys.

# Create the .env file
touch .env

Now, open the .env file in a text editor and add your keys:

BRIGHT_DATA_TOKEN=your_bright_data_token_here
GOOGLE_API_KEY=your_google_api_key_here

Your environment is now fully configured. Let’s dive into the core logic, starting with data collection.

Step 2 – scraping Amazon product data with Bright Data

The foundation of our app is high-quality data. Manually scraping a site like Amazon is complex – you have to manage proxies, handle different page layouts, and find ways to bypass Amazon’s CAPTCHAs and blocking mechanisms.

Bright Data’s Amazon Web Scraper API abstracts all this complexity away. It provides:

  • Enterprise-grade reliability. Built on a network of over 150M ethically-sourced residential proxy IPs across 195 countries, ensuring consistent and uninterrupted access.
  • Zero infrastructure hassle. Automatic IP rotation, CAPTCHA solving, and proxy management are handled for you behind the scenes.
  • Comprehensive structured data. Delivers clean, structured JSON data with 20+ data points per product, including ASIN, prices, ratings, reviews, seller information, product descriptions, images, availability, and much more.
  • Cost-effective pricing. Pay-as-you-go model starting from $0.001 per record, making it scalable for projects of any size.

API integration (api.py)

Our BrightDataAPI class in api.py handles all interactions with the API. It uses a trigger-poll-download workflow, which is ideal for handling potentially long-running scraping jobs.

The trigger_search method initiates the scraping job. Notice the use of the @retry decorator from the tenacity library – this adds resilience by automatically retrying the request with exponential backoff if it fails.

# amazon_analytics/api.py

class BrightDataAPI:
    def __init__(self, token: Optional[str] = None):
        self.token = token or BRIGHT_DATA_TOKEN
        self.base_url = "https://api.brightdata.com/datasets/v3"
        self.headers = {
            "Authorization": f"Bearer {self.token}",
            "Content-Type": "application/json"
        }

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=4, max=10),
        retry=retry_if_exception_type((requests.RequestException, BrightDataAPIError))
    )
    def trigger_search(self, keyword: str, amazon_url: str, pages_to_search: str = "") -> str:
        """Triggers a new scraping job and returns the snapshot ID."""
        payload = [{
            "keyword": keyword,
            "url": amazon_url,
            "pages_to_search": pages_to_search
        }]

        response = requests.post(
            f"{self.base_url}/trigger",
            headers=self.headers,
            json=payload,
            params={
                "dataset_id": BRIGHT_DATA_DATASET_ID,
                "include_errors": "true",
                "limit_multiple_results": "150"
            },
            timeout=30
        )
        response.raise_for_status()
        return response.json()["snapshot_id"]

After triggering a search, the wait_for_results method polls the API until the job is complete and then downloads the data. This prevents the app from hanging while waiting and includes a timeout to avoid infinite loops.

With reliable data collection in place, the next step is to clean and enrich this raw data.

Step 3 – building the data processing pipeline

Raw data from any source is rarely in the perfect format for analysis. Our DataProcessor class in data_processor.py is responsible for cleaning, normalizing, and engineering new features from the scraped Amazon data, making it ready for our AI and visualization layers. For a broader look at data processing, see our guide on data analysis with Python.

Intelligent price parsing

A significant challenge in eCommerce data is handling international formats. For example, a price in Germany might be “1.234,56” while in the US it’s “1,234.56”. The parse_float_locale function intelligently handles these variations.

# amazon_analytics/data_processor.py (simplified for readability)

def parse_float_locale(self, value: Any) -> Optional[float]:
    """Robust float parser handling international number formats."""
    if value is None or value == "":
        return None
    if isinstance(value, (int, float)):
        return float(value)
    if isinstance(value, str):
        s = re.sub(r"[^0-9.,]", "", value)
        has_comma = "," in s
        has_dot = "." in s

        if has_comma and has_dot:
            # Determine decimal separator by last position
            if s.rfind(',') > s.rfind('.'):
                s = s.replace('.', '').replace(',', '.')  # European format
            else:
                s = s.replace(',', '')  # US format
        elif has_comma:
            # Check if comma is thousands separator or decimal
            if re.search(r",\d{3}$", s):
                s = s.replace(',', '')  # Thousands separator
            else:
                s = s.replace(',', '.')  # Decimal separator

        return float(s)
    return None

Custom value scoring algorithm

To help users quickly identify the best products, we created a custom value_score. This composite metric combines multiple factors into a single, easy-to-understand score.

# amazon_analytics/data_processor.py

def compute_value_score(
    self,
    rating: Optional[float],
    num_ratings: Optional[int],
    discount_pct: Optional[float],
    min_reviews: int = 10
) -> float:
    """Computes a composite value score based on quality, social proof, and deal value."""
    score = 0.0

    # 40% weight for product quality (rating)
    if rating and rating > 0:
        score += (rating / 5.0) * 0.4

    # 30% weight for social proof (number of ratings)
    if num_ratings and num_ratings >= min_reviews:
        # Logarithmic scale to prevent mega-popular items from dominating
        review_score = min(math.log10(num_ratings) / 4, 1.0)
        score += review_score * 0.3

    # 30% weight for deal value (discount percentage)
    if discount_pct and discount_pct > 0:
        discount_score = min(discount_pct / 50, 1.0)  # Cap at 50% discount
        score += discount_score * 0.3

    return round(score, 2)

This algorithm balances quality (rating), social proof (review volume), and deal value (discount) to provide a holistic measure of a product’s appeal.
Now that our data is clean and enriched, we can feed it to our AI engine for deeper insights.

Step 4 – integrating AI for smart analysis with Gemini

This is where our app becomes truly intelligent. We use Google’s Gemini AI to analyze the processed data and answer user questions. A major challenge with LLMs is hallucination – inventing facts not present in the source data. Our GeminiAIEngine is designed to prevent this.

# amazon_analytics/gemini_ai_engine.py (significantly simplified for tutorial clarity)

def _create_anti_hallucination_prompt(self, user_query: str, df: pd.DataFrame) -> str:
    """Creates a hallucination-proof prompt by including all data context."""

    # Note: The actual implementation includes detailed field mapping,
    # type conversion, and NaN handling for 20+ product attributes
    products_data = []
    for _, row in df.iterrows():
        product = {
            'name': str(row.get('name', 'N/A')),
            'asin': str(row.get('asin', 'N/A')),
            'final_price': float(row.get('final_price', 0)) if pd.notna(row.get('final_price')) else 0,
            'rating': float(row.get('rating', 0)) if pd.notna(row.get('rating')) else 0,
            'num_ratings': int(row.get('num_ratings', 0)) if pd.notna(row.get('num_ratings')) else 0,
            # ... additional fields with proper type handling
        }
        products_data.append(product)

    return f"""You are an expert Amazon product analyst with advanced reasoning capabilities.

ZERO HALLUCINATION RULES:
1. NEVER make up or invent ANY product information
2. ONLY use data explicitly provided below
3. If information is missing, clearly state "This information is not available"
4. Always cite specific product ASINs for verification
5. Use your reasoning to provide valuable insights based on the actual data

REASONING CAPABILITIES:
- Compare products by analyzing price, ratings, reviews, and features
- Identify best value products by considering price vs rating relationship  
- Assess product trust by evaluating rating quality and review volume
- Detect deals by comparing initial_price vs final_price

USER QUERY: {user_query}

AVAILABLE PRODUCT DATA ({len(df)} products):
{json.dumps(products_data, indent=2)}

Use your reasoning to analyze this data and provide helpful, accurate insights. Include specific ASINs and numbers for verification."""

Key anti-hallucination techniques:

  1. Complete data inclusion. All product information is provided to the AI, leaving no gaps for speculation.
  2. Explicit boundaries. Clear rules about what the AI can and cannot do.
  3. ASIN citations. Forces the AI to reference specific products for verification.
  4. Structured data format. JSON format makes data parsing reliable for the AI.

This prompt engineering approach transforms the AI into a reliable data analyst, making its output trustworthy and verifiable.

With the AI engine ready, we can build the recommendation system.

Step 5 – creating the shopping intelligence engine

The ShoppingIntelligenceEngine in shopping_intelligence.py uses the processed data to generate 3 main recommendations: “Best Overall Value”, “Highest Rated”, and “Best Deal”. The engine applies sophisticated filtering criteria to ensure quality recommendations.

The system works with a list of product dictionaries and uses separate helper methods for each recommendation category, each with specific quality thresholds.

# amazon_analytics/shopping_intelligence.py

class ShoppingIntelligenceEngine:
    def analyze_products(self, products: List[Dict[str, Any]]) -> Dict[str, Any]:
        """Generate shopping intelligence from product data."""
        if not products:
            return {'total_items': 0, 'top_picks': []}

        top_picks = self._generate_top_picks(products)

        return {
            'total_items': len(products),
            'top_picks': top_picks
        }

    def _generate_top_picks(self, products: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """Generate top product recommendations with reasoning."""
        try:
            # First, filter for valid products only
            valid_products = []
            for product in products:
                rating = product.get('rating')
                price = product.get('final_price')

                if rating is not None and price is not None and rating > 0 and price > 0:
                    valid_products.append(product)

            if not valid_products:
                return []

            picks = []
            used_asins = set()

            # Find each category using specialized methods
            best_value = self._find_best_value(valid_products)
            if best_value and best_value.get('asin') not in used_asins:
                picks.append({
                    'product': best_value,
                    'reason': 'Best Overall Value',
                    'explanation': 'Excellent balance of quality, price, and customer reviews'
                })
                used_asins.add(best_value['asin'])

            highest_rated = self._find_highest_rated(valid_products)
            if highest_rated and highest_rated.get('asin') not in used_asins:
                picks.append({
                    'product': highest_rated,
                    'reason': 'Highest Rated',
                    'explanation': 'Top customer satisfaction with proven track record'
                })
                used_asins.add(highest_rated['asin'])

            best_deal = self._find_best_deal(valid_products)
            if best_deal and best_deal.get('asin') not in used_asins:
                picks.append({
                    'product': best_deal,
                    'reason': 'Best Deal',
                    'explanation': 'Great value with significant savings and good quality'
                })
                used_asins.add(best_deal['asin'])

            # Fill remaining slots with quality products if needed
            if len(picks) < 3:
                remaining_products = [p for p in valid_products if p.get('asin') not in used_asins]
                remaining_products.sort(key=lambda x: x.get('value_score', 0), reverse=True)

                for product in remaining_products[:3-len(picks)]:
                    picks.append({
                        'product': product,
                        'reason': 'Quality Choice',
                        'explanation': 'Good balance of quality and value'
                    })

            return picks[:3]

        except Exception:
            return []

Quality filtering methods

Each recommendation category has specific quality thresholds to ensure reliable recommendations:

def _find_best_value(self, products: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Find product with best value score - requires 10+ reviews."""
    candidates = [p for p in products if
                 p.get('value_score') is not None and
                 p.get('num_ratings', 0) >= 10]

    if not candidates:
        return None

    return max(candidates, key=lambda p: p.get('value_score', 0))

def _find_highest_rated(self, products: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Find highest rated product - requires 4.0+ rating and 50+ reviews."""
    candidates = [p for p in products if
                 p.get('rating', 0) >= 4.0 and
                 p.get('num_ratings', 0) >= 50]

    if not candidates:
        return None

    return max(candidates, key=lambda p: (p.get('rating', 0), p.get('num_ratings', 0)))

def _find_best_deal(self, products: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Find best discount - requires 10%+ discount and 3.5+ rating."""
    candidates = [p for p in products if
                 p.get('discount_pct') is not None and
                 p.get('discount_pct', 0) >= 10 and
                 p.get('rating', 0) >= 3.5]

    if not candidates:
        return None

    return max(candidates, key=lambda p: p.get('discount_pct', 0))

Key design decisions:

  1. Quality thresholds. Each category has minimum standards to prevent recommending poor products.
  2. No duplicates. used_asins set ensures each product appears only once.
  3. Fallback logic. If fewer than 3 recommendations found, fills with next-best value scores.
  4. Error handling. Try/catch prevents crashes on malformed data.

This approach ensures users get reliable, high-quality recommendations rather than just the first products found.

Now we have all the backend components. Let’s build the user interface to bring it all together.

Step 6 – designing the interactive dashboard with Streamlit

The final piece is the user interface, handled by streamlit_app.py. Streamlit allows you to build a reactive, web-based dashboard with minimal code. The app uses a sophisticated tab-based layout with real-time progress tracking and multiple chart types.

Session state and component caching

The app uses specific session state variables to manage data flow and caches backend components for performance:

# streamlit_app.py - Session state initialization

if 'search_results' not in st.session_state:
    st.session_state.search_results = []
if 'shopping_intelligence' not in st.session_state:
    st.session_state.shopping_intelligence = {}
if 'current_run_id' not in st.session_state:
    st.session_state.current_run_id = None

@st.cache_resource
def get_backend_components():
    """Initialize and cache backend components."""
    api = BrightDataAPI()
    processor = DataProcessor()
    intelligence = ShoppingIntelligenceEngine()
    ai_engine = get_gemini_ai()
    return api, processor, intelligence, ai_engine

Inline search processing with progress tracking

The search logic is embedded directly in the main app flow with detailed progress tracking and data persistence:

# streamlit_app.py - Search processing (simplified)

# Search execution with progress tracking
if search_clicked and keyword.strip():
    progress_bar = st.progress(0)
    status_text = st.empty()
    start_time = time.time()

    try:
        # Trigger search
        status_text.text("Starting Amazon search...")
        snapshot_id = api.trigger_search(keyword, amazon_url)
        progress_bar.progress(25)

        # Wait for results with smart progress updates
        status_text.text("Amazon is processing your search...")
        results = smart_wait_for_results(api, snapshot_id, progress_bar, status_text)
        progress_bar.progress(75)

        # Process results
        status_text.text("Analyzing products...")
        processed_results = processor.process_raw_data(results)
        shopping_intel = intelligence.analyze_products(processed_results)

        # Store comprehensive results in session state
        st.session_state.search_results = processed_results
        st.session_state.shopping_intelligence = shopping_intel
        st.session_state.current_run_id = str(uuid.uuid4())
        st.session_state.raw_data = results
        st.session_state.search_metadata = {
            'keyword': keyword,
            'country': countries[selected_country],
            'domain': amazon_url,
            'timestamp': datetime.now(timezone.utc).isoformat()
        }

        elapsed_time = time.time() - start_time
        status_text.text(f"Found {len(processed_results)} products in {elapsed_time:.1f}s!")
        progress_bar.progress(100)

    except Exception as e:
        st.error(f"Search failed: {str(e)}")

Multiple interactive visualizations

The Market Analysis tab creates various chart types inline, each with specific styling and annotations:

# streamlit_app.py - Price distribution with median line

fig_price = px.histogram(
    x=display_prices,
    nbins=min(20, max(1, unique_prices)),
    title="Price Range",
    labels={'x': f'Price ({currencies.get(current_country_code, "USD")})', 'y': 'Number of Products'},
    color_discrete_sequence=['#667eea']
)

# Add median line for context
fig_price.add_vline(x=q50, line_dash="dash", line_color="orange", annotation_text="Median")
st.plotly_chart(fig_price, use_container_width=True)

# Rating vs Price scatter with size and color encoding
fig_scatter = px.scatter(
    df_scatter,
    x='final_price',
    y='rating',
    size='num_ratings',
    hover_data=['name', 'num_ratings'],
    title="Quality vs Price",
    labels={'final_price': f'Price ({currencies.get(current_country_code, "USD")})', 'rating': 'Rating (Stars)'},
    color='rating',
    color_continuous_scale='Viridis'
)
st.plotly_chart(fig_scatter, use_container_width=True)

# Value score distribution with percentile markers
fig_value = px.histogram(
    x=value_scores,
    nbins=20,
    title="Best Value Products",
    labels={'x': 'Value Score (0.0-1.0)', 'y': 'Number of Products'},
    color_discrete_sequence=['#28a745']
)
p50 = np.percentile(value_scores, 50)
p75 = np.percentile(value_scores, 75)
fig_value.add_vline(x=p50, line_dash="dash", line_color="orange", annotation_text="Median")
fig_value.add_vline(x=p75, line_dash="dash", line_color="red", annotation_text="75th %ile")
st.plotly_chart(fig_value, use_container_width=True)

Advanced chart features

The dashboard includes sophisticated visualizations with business intelligence:

  • Price histograms. With median and quartile markers for market positioning.
  • Rating scatter plots. Size represents review volume, color shows rating quality.
  • Position pie charts. Shows search ranking distribution (1-5, 6-10, 11-20, 21+).
  • Price category bar charts. Segments products into Budget/Value/Premium/Luxury tiers.
  • Discount analysis. Identifies genuine deals vs. inflated pricing.

This comprehensive approach creates a professional analytics dashboard that provides actionable market insights.

Conclusion

You’ve successfully built an Amazon product analyzer that leverages enterprise-grade data collection, advanced AI, and interactive data visualization. The complete source code for this project is available for you to explore and adapt on GitHub.

You’ve seen how to:

  • Use Bright Data’s Web Scraper API to reliably scrape Amazon data at scale.
  • Implement a robust data processing pipeline to handle complex, real-world data challenges.
  • Engineer a hallucination-proof AI assistant with Google Gemini for trustworthy analysis.
  • Build an intuitive and interactive user interface with Streamlit and Plotly.

This project serves as a powerful template for any app that requires turning vast amounts of web data into actionable business intelligence. From here, you could extend it to build a dedicated Amazon price tracker or integrate other data sources.

The world of eCommerce data is vast. If you need pre-collected, ready-to-use datasets, explore Bright Data’s marketplace for a wide range of options.

Satyam Tripathi

Technical Writer

5 years experience

Satyam Tripathi helps SaaS and data startups turn complex tech into actionable content, boosting developer adoption and user understanding.

Expertise
Python Developer Education Technical Writing