Web Scraping with AutoScraper in Python

AutoScraper is a Python library that simplifies web scraping by automatically identifying and extracting data from websites without manual HTML inspection. Unlike traditional scraping tools, AutoScraper learns the structure of data elements based on example queries, making it a great choice for both beginners and experienced developers. Ideal for tasks like collecting product info, aggregating content, or performing market research, AutoScraper handles dynamic websites efficiently, without any complex setup.

In this article, you’ll learn how to use AutoScraper with Python for web scraping.

Prerequisites

Setting up AutoScraper is easy. You, of course, need to have Python 3 or a later version installed locally. Like any other Python web scraping project, you just need to run a few commands to create a project directory and create and activate a virtual environment in it:

# Set up project directory
mkdir auto-scrape
cd auto-scrape

# Create virtual environment
python -m venv env
# For Mac & Linux users
source env/bin/activate
# For Windows users
venv\Scripts\activate

Using a virtual environment simplifies dependency management in the project.

Next, install the autoscraper library by running the following command:

pip install autoscraper

You also need to install pandas to save the scraping results to a CSV file at the end. pandas is a Python library that offers an easy-to-use data analysis and manipulation tool. It allows you to easily process and save scraped results in various formats, such as CSV, XLSX, and JSON. Run the following command to install it:

pip install pandas

Select a Target Website

When scraping public websites, make sure you check the site’s Terms of Service (ToS) or robots.txt file to ensure the site allows scraping. This helps you avoid any legal or ethical issues. Additionally, it’s best to select websites that provide data in a structured format, such as tables or lists, which is easier to extract.

Traditional scraping tools often require analyzing the HTML structure of a web page to locate the target data elements. This can be time-consuming and requires familiarity with tools like browser developer consoles. However, AutoScraper simplifies this step by automatically learning the structure of the data based on example data (also known as wanted_list), eliminating the need for manual inspection.

In this tutorial, you’ll start by scraping data from Scrape This Site’s Countries of the World: A Simple Example page, a beginner-friendly sandbox designed for testing scraping tools. This page has a straightforward structure, which is ideal for demonstrating basic scraping techniques. Once you’ve mastered the basic structure, you’ll move on to the Hockey Teams: Forms, Searching and Pagination page, which features a more complex layout.

Scrape Simple Data with AutoScraper

Now that you’ve identified two pages you want to scrape, it’s time to start scraping!

Since the Countries of the World: A Simple Example page is straightforward, the following script can be used to scrape a list of countries, along with their capital, population, and area:

# 1. Import dependencies
from autoscraper import AutoScraper
import pandas as pd

# 2. Define the URL of the site to be scraped
url = "https://www.scrapethissite.com/pages/simple/"

# 3. Instantiate the AutoScraper
scraper = AutoScraper()

# 4. Define the wanted list by using an example from the web page
# This list should contain some text or values that you want to scrape
wanted_list = ["Andorra", "Andorra la Vella", "84000", "468.0"]

# 5. Build the scraper based on the wanted list and URL
scraper.build(url, wanted_list)

# 6. Get the results for all the elements matched
results = scraper.get_result_similar(url, grouped=True)

# 7. Display the keys and sample data to understand the structure
print("Keys found by the scraper:", results.keys())

# 8. Assign columns based on scraper keys and expected order of data
columns = ["Country Name", "Capital", "Area (sq km)", "Population"]

# 9. Create a DataFrame with the extracted data
data = {columns[i]: results[list(results.keys())[i]] for i in range(len(columns))}
df = pd.DataFrame(data)

# 10. Save the DataFrame to a CSV file
csv_filename = 'countries_data.csv'
df.to_csv(csv_filename, index=False)

print(f"Data has been successfully saved to {csv_filename}")

This code has inline comments to explain what’s happening, but here’s a quick summary: The script starts by importing AutoScraper and pandas. Next, you define the URL of the target website. Then, you create an instance of the scraper.

Now, here’s the interesting part: Instead of providing detailed instructions to the scraper on where the target data is on the website (like you would for other scrapers, probably through XPath or other selectors), you simply provide an example of the data that you’re looking for. Under the fourth comment, the data points for one of the countries are provided to the scraper as an array (also known as the wanted_list).

Once the wanted_list is ready, you build the scraper using the URL and the wanted_list. The scraper downloads the target website and generates rules that it stores in its stack list. It uses these rules to extract data from any target URL in the future.

In the code under comment six, you use the get_result_similar method on the AutoScraper model to extract data from the target URL that is similar to the data in the wanted_list. The next line is a simple print statement to show you the IDs of the rules under which data has been found on the target URL. Your output should look like this:

Keys found by the scraper: dict_keys(['rule_4y6n', 'rule_gghn', 'rule_a6r9', 'rule_os29'])

The code under comments eight and nine creates the header schema for your CSV file and formats the extracted data in a pandas DataFrame. Finally, the code under comment ten saves the data to the CSV.

Once you run this script (by saving the previous script in a file called script.py and running python script.py on the command line), you’ll notice a new file named countries_data.csv has been created in the project directory with content that looks like this:

Country Name,Capital,Area (sq km),Population
Andorra,Andorra la Vella,84000,468.0
United Arab Emirates,Abu Dhabi,4975593,82880.0
...246 collapsed rows
Zambia,Lusaka,13460305,752614.0
Zimbabwe,Harare,11651858,390580.0

That’s it! That’s how simple it is to scrape straightforward websites with AutoScraper.

Process and Extract Data from Websites with a Complex Design

When it comes to slightly more complex websites, like the Hockey Teams: Forms, Searching and Pagination page that contains a table with a lot of similar values, the technique shown earlier can fail. You can try extracting the team name, year, wins, losses, and other fields from this website using the same method shown earlier to see the issue for yourself.

Thankfully, AutoScraper allows for finer model training by pruning the collected rules during the build step before using the model to extract data. Here’s the code to help you do that:

from autoscraper import AutoScraper
import pandas as pd

# Define the URL of the site to be scraped
url = "https://www.scrapethissite.com/pages/forms/"

def setup_model():

    # Instantiate the AutoScraper
    scraper = AutoScraper()

    # Define the wanted list by using an example from the web page
    # This list should contain some text or values that you want to scrape
    wanted_list = ["Boston Bruins", "1990", "44", "24", "0.55", "299", "264", "35"]

    # Build the scraper based on the wanted list and URL
    scraper.build(url, wanted_list)

    # Get the results for all the elements matched
    results = scraper.get_result_similar(url, grouped=True)

    # Display the data to understand the structure
    print(results)

    # Save the model
    scraper.save("teams_model.json")

def prune_rules():
    # Create an instance of Autoscraper
    scraper = AutoScraper()
    
    # Load the model saved earlier
    scraper.load("teams_model.json")

    # Update the model to only keep necessary rules
    scraper.keep_rules(['rule_hjk5', 'rule_9sty', 'rule_2hml', 'rule_3qvv', 'rule_e8x1', 'rule_mhl4', 'rule_h090', 'rule_xg34'])

    # Save the updated model again
    scraper.save("teams_model.json")
    
def load_and_run_model():
    # Create an instance of Autoscraper
    scraper = AutoScraper()
    
    # Load the model saved earlier
    scraper.load("teams_model.json")

    # Get the results for all the elements matched
    results = scraper.get_result_similar(url, grouped=True)

    # Assign columns based on scraper keys and expected order of data
    columns = ["Team Name", "Year", "Wins", "Losses", "Win %", "Goals For (GF)", "Goals Against (GA)", "+/-"]

    # Create a DataFrame with the extracted data
    data = {columns[i]: results[list(results.keys())[i]] for i in range(len(columns))}
    df = pd.DataFrame(data)

    # Save the DataFrame to a CSV file
    csv_filename = 'teams_data.csv'
    df.to_csv(csv_filename, index=False)

    print(f"Data has been successfully saved to {csv_filename}")

# setup_model()
# prune_rules()
# load_and_run_model()

This script has three methods defined in it: setup_model, prune_rules, and load_and_run_model. The setup_model method is similar to what you saw earlier. It creates an instance of a scraper, creates a wanted_list, builds the scraper using the wanted_list, scrapes the data from the target URL using this scraper, prints the keys (ie the rule IDs collected during this extraction), and saves the model as is in a file named teams_model.json in the project directory.

To run this, uncomment the # setup_model() line in the previous script, save the complete script in a file (eg script.py), and runpython script.py`. Your output should look like this:

{'rule_hjk5': ['Boston Bruins', 'Buffalo Sabres', 'Calgary Flames', 'Chicago Blackhawks', 'Detroit Red Wings', 'Edmonton Oilers', 'Hartford Whalers', 'Los Angeles Kings', 'Minnesota North Stars', 'Montreal Canadiens', 'New Jersey Devils', 'New York Islanders', 'New York Rangers', 'Philadelphia Flyers', 'Pittsburgh Penguins', 'Quebec Nordiques', 'St. Louis Blues', 'Toronto Maple Leafs', 'Vancouver Canucks', 'Washington Capitals', 'Winnipeg Jets', 'Boston Bruins', 'Buffalo Sabres', 'Calgary Flames', 'Chicago Blackhawks'], 'rule_uuj6': ['Boston Bruins', 'Buffalo Sabres', 'Calgary Flames', 'Chicago Blackhawks', 'Detroit Red Wings', 'Edmonton Oilers', 'Hartford Whalers', 'Los Angeles Kings', 'Minnesota North Stars', 'Montreal Canadiens', 'New Jersey Devils', 'New York Islanders', 'New York Rangers', 'Philadelphia Flyers', 'Pittsburgh Penguins', 'Quebec Nordiques', 'St. Louis Blues', 'Toronto Maple Leafs', 'Vancouver Canucks', 'Washington Capitals', 'Winnipeg Jets', 'Boston Bruins', 'Buffalo Sabres', 'Calgary Flames', 'Chicago Blackhawks'], 'rule_9sty': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_9nie': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_41rr': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_ufil': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_ere2': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_w0vo': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_rba5': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_rmae': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_ccvi': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_3c34': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_4j80': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_oc36': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_93k1': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_d31n': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_ghh5': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_5rne': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_4p78': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_qr7s': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_60nk': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_wcj7': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_0x7y': ['1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1990', '1991', '1991', '1991', '1991'], 'rule_2hml': ['44', '31', '46', '49', '34', '37', '31', '46', '27', '39', '32', '25', '36', '33', '41', '16', '47', '23', '28', '37', '26', '36', '31', '31', '36'], 'rule_swtb': ['24'], 'rule_e8x1': ['0.55', '14', '0.575', '0.613', '-25', '0', '-38', '0.575', '-10', '24', '8', '-67', '32', '-15', '0.512', '-118', '0.588', '-77', '-72', '0', '-28', '-5', '-10', '-9', '21'], 'rule_3qvv': ['24', '30', '26', '23', '38', '37', '38', '24', '39', '30', '33', '45', '31', '37', '33', '50', '22', '46', '43', '36', '43', '32', '37', '37', '29'], 'rule_n07w': ['24', '30', '26', '23', '38', '37', '38', '24', '39', '30', '33', '45', '31', '37', '33', '50', '22', '46', '43', '36', '43', '32', '37', '37', '29'], 'rule_qmem': ['0.55', '0.388', '0.575', '0.613', '0.425', '0.463', '0.388', '0.575', '0.338', '0.487', '0.4', '0.312', '0.45', '0.412', '0.512', '0.2', '0.588', '0.287', '0.35', '0.463', '0.325', '0.45', '0.388', '0.388', '0.45'], 'rule_b9gx': ['264', '278', '263', '211', '298', '272', '276', '254', '266', '249', '264', '290', '265', '267', '305', '354', '250', '318', '315', '258', '288', '275', '299', '305', '236'], 'rule_mhl4': ['299', '292', '344', '284', '273', '272', '238', '340', '256', '273', '272', '223', '297', '252', '342', '236', '310', '241', '243', '258', '260', '270', '289', '296', '257'], 'rule_24nt': ['264', '278', '263', '211', '298', '272', '276', '254', '266', '249', '264', '290', '265', '267', '305', '354', '250', '318', '315', '258', '288', '275', '299', '305', '236'], 'rule_h090': ['264', '278', '263', '211', '298', '272', '276', '254', '266', '249', '264', '290', '265', '267', '305', '354', '250', '318', '315', '258', '288', '275', '299', '305', '236'], 'rule_xg34': ['35', '14', '81', '73', '-25', '0', '-38', '86', '-10', '24', '8', '-67', '32', '-15', '37', '-118', '60', '-77', '-72', '0', '-28', '-5', '-10', '-9', '21']}

This shows the complete data collected by AutoScraper in its get_result_similar call from the target website. You’ll notice that this data contains many duplicates. This is because AutoScraper not only collects the data from the target website but also tries to make sense of it by guessing relations between it and creating groups of data points called rules that it thinks are related to each other. If it can group the data correctly, you’ll be able to extract data from similar websites very easily, as you did in the previous example.

However, AutoScraper seems to struggle with this website. Since it contains a lot of numbers, AutoScraper ends up assuming a large number of correlations between various numbers, and you end up with a large data set of rules with duplicate data points.

Now, you need to carefully analyze this data set and pick out the rules that contain the right data (ie just contain the right data from one column in the right order) for your scraping job.

For this output, the following rules happened to contain the right data (found by looking at some data points manually and ensuring that each of the picked rules received twenty-five data elements, which is the number of rows in the table on the target page):

['rule_hjk5', 'rule_9sty', 'rule_2hml', 'rule_3qvv', 'rule_e8x1', 'rule_mhl4', 'rule_h090', 'rule_xg34']

You need to update this in the prune_rules method. Then, you need to comment out the setup_model() line and uncomment the prune_rules() line in the script and run it. This time, it loads the previously created model from the teams_model.json file, removes everything except the listed rules, and then saves it back to the same file. You can even check out the contents of the teams_model.json to see which rules are currently stored in it. Once you’ve completed that, your model is ready.

Now, you can run the load_and_run_model method by commenting out the prune_rules and prune_rules lines, uncommenting the load_and_run_model line in the same script, and rerunning it. It extracts and saves the right data for you in a file named teams_data.csv in the project directory, along with printing the following output:

Data has been successfully saved to teams_data.csv

Here’s what the teams_data.csv file looks like after a successful run:

Team Name,Year,Wins,Losses,Win %,Goals For (GF),Goals Against (GA),+/-
Boston Bruins,1990,44,0.55,24,299,264,35
Buffalo Sabres,1990,31,14,30,292,278,14
...21 more rows
Calgary Flames,1991,31,-9,37,296,305,-9
Chicago Blackhawks,1991,36,21,29,257,236,21

You can check out the code developed in this article in this GitHub repo.

Common Challenges with AutoScraper

While AutoScraper thrives for simple use cases where your target website contains a relatively small data set with distinct data points, it can be cumbersome to set up for complex use cases, such as a website with a table that you saw earlier. Additionally, AutoScraper doesn’t support JavaScript rendering, so you need to integrate it with a module like Splash or a full-fledged library like Selenium or Puppeteer.

If you run into issues like IP blocks or need to customize headers when scraping, AutoScraper supports specifying a set of additional request parameters used by its requests module like this:

# build the scraper on an initial URL
scraper.build(
    url,
    wanted_list=wanted_list,
    request_args=dict(proxies=proxies) # this is where you can pass in a list of proxies or customer headers
)

For example, here’s how you can set a custom user agent and a proxy for scraping with AutoScraper:

request_args = { 
  "headers: {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 \
            (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"  # You can customize this value with your desired user agent. This value is the default used by Autoscraper.
  },
  "proxies": {
    "http": "http://user:[email protected]:3128/" # Example proxy to show you how to use the proxy port, host, username, and password values
  }
}
# build the scraper on an initial URL
scraper.build(
    url,
    wanted_list=wanted_list,
    request_args=request_args
)

However, if you want to avoid being blocked again and again, you need a good proxy that is optimized for web scraping. For this, you should consider using the Bright Data residential proxies, which are spread over 72 million residential IP addresses across 195 countries.

The AutoScraper library uses the Python request library internally to send requests to the target website, and it does not inherently support rate-limiting. For handling rate-limiting restrictions from websites, you need to manually set up a throttling function or use a prebuilt solution like the ratelimit library.

Since AutoScraper works with non-dynamic websites only, it cannot handle CAPTCHA-protected sites at all. In such cases, it makes sense to use a more detailed solution like the Bright Data Web Scraping API, which provides you with structured data from sites like LinkedIn, Amazon, and Zillow.

Conclusion

In this article, you learned what AutoScraper is all about and how to extract data from simple and complex websites using it. As you saw, toward the end, AutoScraper relies on a simple requests call to access target websites, which means it often struggles with dynamic websites and those protected by challenges like CAPTCHA. Additionally, you need to use proxies when web scraping as most websites can identify clients with abnormally high traffic. In such cases, Bright Data can help.

Bright Data is a leading vendor for proxy networks, AI-powered web scrapers, and business-ready datasets. Sign up now and start exploring Bright Data’s products, including a free trial!

Start free trial

Start free with Google

No credit card required

Web Scraping with AutoScraper Tutorial

Prerequisites

Select a Target Website

Scrape Simple Data with AutoScraper

Process and Extract Data from Websites with a Complex Design

Common Challenges with AutoScraper

Conclusion

You might also be interested in

How to Use Web Scraping for Machine Learning

How to Scrape Google Maps With Python

What Is Retrieval-Augmented Generation (RAG)?