In this guide, you will learn the following:
- What Dify is and why use it.
- The reason why you should integrate it with an all-in-one scraping plugin.
- Benefits of integrating Dify with the Bright Data scraping plugin.
- A step-by-step tutorial to create a Dify scraping workflow.
Let’s dive in!
Dify: The Power of Low-Code AI Development
Dify is an open-source LLM app development platform. It works as an LLM-ops solution that simplifies the creation of AI-powered applications.
More specifically, it helps developers build and launch ready-to-use agentic AI applications by providing:
- Visual workflow builder: Design multi-step AI processes using a drag-and-drop interface. You can chain together different models, tools, and logic without getting bogged down in boilerplate code.
- Model agnosticism: Integrate with a wide range of LLMs, from proprietary models like OpenAI’s GPT series to various open-source alternatives. This gives you the flexibility to choose the best one for your use case.
- Backend-as-a-service (BaaS): Handle the complexities of hosting, scaling, and managing the backend infrastructure. This allows you to focus on leveraging AI’s capabilities instead of managing the underlying infrastructure.
- Extensibility: Easily extend functionality through plugins and custom tools from third-party providers. This makes Dify adaptable to a wide range of use cases.
The Need for a Dedicated Scraping Plugin in Dify
Large-scale web scraping presents a lof of challenges. Websites use anti-bot measures that can easily block simple data retrieval attempts. As a result, building and maintaining a system to overcome these hurdles is complex and resource-intensive.
This is precisely where the Bright Data Dify plugin comes into play. The plugin handles all the underlying complexities, from proxy rotation and IP management to solving CAPTCHAs and parsing data. In other words, it ensures your Dify agent receives consistent, high-quality web data.
In detail, the Bright Data plugin provides these tools:
- Structured data feeds: To get structured, organized data from over 50 platforms, such as e-commerce product pages or real estate listings.
- Scrape as markdown: It strips away ads, navigation bars, and other non-essential elements, delivering a clean, markdown-formatted version of the text.
- Search engine tool: Perform queries directly on search engines like Google, Bing, Yandex, and many others. You can use it to monitor search rankings for specific keywords, discover competitor content, or in SERP RAG workflows.
Benefits of Integrating Dify with the Bright Data Plugin
When you connect Dify’s AI orchestration capabilities with Bright Data’s scraping ones, you unlock this functionality:
- Access to real-time data: Instead of relying on outdated data, your AI agent can query the live web for up-to-the-minute information. This guarantees that your AI applications operate with the most current data available.
- Automate complex research and analysis: By feeding data directly into an LLM within a Dify workflow, you can automate tasks that would otherwise require hours of manual work. For example, you could build an RAG workflow to monitor a list of competitor products on an e-commerce site.
- Simplify technical complexity: Web scraping is not easy, as sites employ sophisticated anti-scraping blocking techniques. The Bright Data plugin avoids the blocks for you. All that, while Dify provides the simple interface to harness this power.
- Versatility for diverse use cases: The plugin equips you with multiple tools, including getting structured data, scraping any page to clean markdown, and performing search engine queries. That makes the Dify + Bright Data integration adaptable for several use cases.
Integrating Dify with Bright Data for Products Summarization: Step-by-Step Tutorial
Time to go through a step-by-step tutorial to learn how to use the integration between Dify and Bright Data.
The goal of the workflow you will create is to give an Amazon product as input and receive its summary. The product you will use is from Amazon and is an Apple AirTag:
To achieve the AI scraping objective, you will build a four-stage workflow by connecting different nodes. Each node has a specific job:
- A “Start” node to define the input variable, wich is the URL of the Amazon product page.
- A “Structured Data Feeds” node will take that URL and scrape its content, extracting all the structured data from the Amazon page.
- An “LLM” node to process the scraped data. You will instruct it with a specific prompt to generate the product summary.
- An “End” node to present the summarized text generated by the LLM.
This entire four-step AI scraping process is completely visual. You will connect these nodes in a simple flow, and you will not have to write a single line of code.
Follow the instructions to build your Bright Data-powered no-code web scraping AI workflow in Dify!
Requirements
To reproduce this tutorial on how to integrate Dify with Bright Data, you need:
- A Dify account (a free account is sufficient).
- A Bright Data API key.
If you do not have these yet, use the links above and follow the instructions to get everything set up.
Prerequisites
In order to use the LLM node, you first need to set up the LLM integration in Dify. To do so, click on your profile image and select the “Settings” option:
You will be redirected to the page that allows you to select a model (the “Model Provider” tab). For example, you can install the OpenAI provider plugin:
Very good! You are now ready to start your Dify web scraping workflow.
Step #1: Download the Bright Data Plugin and Integrate It
Download the latest Bright Data plugin package from the official Dify repository. Then, press “PLUGINS” and select the “Install from Local Package File” option:
Select the local file you downloaded earlier, and click the “Install” button:
Good! Bright Data’s integration package is now loaded and installed on Dify.
Step #2: Create a New Dify Application
From the Dify workspace homepage, create a new application from scratch by selecting “Create from Blank” as shown below:
Next, choose the “Workflow” type and click “Create”:
Below is what the new, blank workflow will look like:
Terrific! You have just created a new Dify workflow. Time to add the required nodes for web scraping.
Step #3: Configure Nodes for Web Scraping
Now, you can add the nodes to your workflow and set the needed parameters for the Dify web scraping workflow via Bright Data.
Begin by clicking on the “Start” node, then on “INPUT FIELD”:
Select “Paragraph” as a type, and give a name to the “Variable Name” field. For example, product_url
. Change the “Max length” value to be at least 200. That represents the URL of the target page to scrape. You will need to pass it an input to launch the workflow.
Confirm by clicking the “Save” button:
Perfect! The “Start” node is correctly set up.
Continue by clicking the “+” in the “Start” node. Select “Tools” > “Bright Data Web Scraper” > “Structured Data Feeds”:
The Bright Data node acts as the bridge that connects your Dify workflow to the [Bright Data AI infrastructure](
/ai). It gives your AI scraping agent the ability to scrape the information it needs from the web.
By selecting the “Structured Data Feeds” tool, you will turn a messy Amazon product page into a strucutred JSON output with predictable data fiels.
Now, click on “Authorize” to enter your Bright Data API token:
Select product_url
as the input variable. That way, the “Start” node will pass the actual value of the product URL as the input of the Bright Data node.
To do so, type “/” in the “Target URL” field, and it will show you a list of available variables. Also, add a description in the “Data Request Description” field:
Very well! The Bright Data node is set up. You can move to the next node.
Click on the “+” and add an LLM node:
In the “MODEL” section, select “Configure model” and select an LLM model from the list:
In the “SYSTEM” section, add a prompt, such as:
This prompt tells the LLM to act as an e-commerce analyst with the goal of creating a summary of the scraped product. It also asks for specific details to include, like the name of the product and some key features. Note that it includes the text result of the Bright Data plugin node in the end.
This is what the filled-out section will look like:
Under the “Data” section of prompt, add the text
as the input variable. This will allow the LLM to use the content that the Bright Data node has retrieved from the target URL. If you click on “/”, you will obtain the list of available variables you can select.
Good! You can now add the last node to the workflow.
The output of the workflow can be obtained by adding an “End” node:
The output variable must be a string coming from the LLM node. To do so, click on the “OUTPUT VARIABLE” section and select “text” under “LLM”:
Amazing! Your workflow is correctly set up. You are now ready to run it.
Step #4: Run the Workflow
Below is the web scraping workflow in Dify via the Bright Data plugin:
As you can see, it consists of only four nodes—just as anticipated in the introduction to this chapter. Also, you did not have to write a single line of code to achieve the goal!
To run the workflow, click on “Run”. At this point, you need to add the URL of the Amazon product under the “product_url” field. Then, click “Start Run” to launch the Dify web scraping workflow:
The result will be available in the “Result” tab:
Below is the result as text:
As asked, the LLM reported what you asked for in the prompt:
- A one-sentence summary of the product.
- 5 key features.
- The rating.
- A conclusive sentence, telling who this product is for.
If you have ever tried to scrape major e-commerce sites like Amazon, you know how hard it is:
This is where the Bright Data integration makes all the difference. It handled all the complex anti-scraping measures behind the scenes, making sure that the data retrieval process works as expected.
Et voilà! You have successfully completed your first project integrating Dify with Bright Data.
Conclusion
In this article, you learned how to use Dify to build a no-code AI scraping workflow. This would not have been possible without the Bright Data Dify plugin. As shown here, that plugin exposes several advanced tools for web scraping within AI workflows.
Now, one of the main challenges in building a reliable scraping workflow for your AI agents is having access to high-quality web data. This requires tools for retrieving, validating, and transforming web content, which is exactly what Bright Data’s AI infrastructure is built to deliver.
Create a free Bright Data account and start experimenting with our AI-ready data tools today!