find()
and find_all()
are essential methods for web scraping with BeautifulSoup, helping you extract data from HTML. The find()
method retrieves the first element matching your criteria, such as find("div")
to get the first div
on a page, returning None
if no match is found. Meanwhile, find_all()
finds all matching elements and returns them as a list, making it perfect for extracting multiple elements like all div
tags. Before starting your web scraping journey with BeautifulSoup, ensure you have both Requests and BeautifulSoup installed.
Install dependencies
pip install requests
pip install beautifulsoup4
find()
Let’s get acquainted with find()
. In the examples below, we’ll use Quotes To Scrape and the Fake Store API for finding elements on the page. Both of these sites were built for scraping. They don’t change much, so they’re perfect for learning.
Find by Class
To find an element using its class
, we use the class_
keyword. You might wonder why class_
and not class
? In Python, class
is a keyword used for creating custom datatypes. The underscore in class_
prevents this keyword from causing conflicts with our code.
The example below finds the first div
with the class
: quote
.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
first_quote = soup.find("div", class_="quote")
print(first_quote.text)
Here is our output.
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)
Tags:
change
deep-thoughts
thinking
world
Find By ID
When scraping, you’ll also commonly need to look for elements using their id
. In the example below, we use the id
arg to find the menu on the page. Here, we find the menu on the page using its id
.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://fakestoreapi.com")
soup = BeautifulSoup(response.text, "html.parser")
ul = soup.find("ul", id="menu")
print(ul.text)
Here is the menu once we’ve extracted it and printed it to the terminal.
Home
Docs
GitHub
Buy me a coffee
Find by Text
We can also search for items using their text. To do this, we use the string
argument. The example below finds the Login
button on the page.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
login_button = soup.find("a", string="Login")
print(login_button.text)
As you can see, Login
is printed to the console.
Login
Find by Attribute
We can also use different attributes for more precise searching. This time, we once again find the first quote from the page. However, we look for a span
with the itemprop
of text
. This once again finds our first quote, but without all the extra stuff, like author
and tags
.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
first_clean_quote = soup.find("span", attrs={"itemprop": "text"})
print(first_clean_quote.text)
Here’s the clean version of our first quote.
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Find Using Multiple Criteria
As you may have noticed earlier, the attr
argument takes a dict
instead of a single value. This allows us to pass in multiple criteria for even better filtering. Here, we find the first author on the page using the class
and itemprop
attributes.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
first_author = soup.find("small", attrs={"class": "author", "itemprop": "author"})
print(first_author.text)
When you run this, you should get Albert Einstein
as output.
Albert Einstein
find_all()
Now, let’s go through these same examples using find_all()
. Once again, we’ll use Quotes to Scrape and the Fake Store API. These examples are almost identical, but with one major difference. find()
returns a single element. find_all()
returns a list
of page elements.
Find by Class
To find elements using their class
, we use the class_
keyword argument. The code below uses find_all()
to extract each Quote using its CSS class.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
quotes = soup.find_all("div", class_="quote")
for quote in quotes:
print("-------------")
print(quote.text)
When we extract and print the first page of quotes, it looks like this.
-------------
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)
Tags:
change
deep-thoughts
thinking
world
-------------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)
Tags:
abilities
choices
-------------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)
Tags:
inspirational
life
live
miracle
miracles
-------------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)
Tags:
aliteracy
books
classic
humor
-------------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe
(about)
Tags:
be-yourself
inspirational
-------------
“Try not to become a man of success. Rather become a man of value.”
by Albert Einstein
(about)
Tags:
adulthood
success
value
-------------
“It is better to be hated for what you are than to be loved for what you are not.”
by André Gide
(about)
Tags:
life
love
-------------
“I have not failed. I've just found 10,000 ways that won't work.”
by Thomas A. Edison
(about)
Tags:
edison
failure
inspirational
paraphrased
-------------
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
by Eleanor Roosevelt
(about)
Tags:
misattributed-eleanor-roosevelt
-------------
“A day without sunshine is like, you know, night.”
by Steve Martin
(about)
Tags:
humor
obvious
simile
Find by ID
As we talked about when using find()
, id
is another one of the more common methods you might use to extract data from the page. To extract data using its id
, we use the id
argument… just like we did earlier.
We then find all of the ul
items with an id
of menu
. There’s only one menu, so we’ll actually only find one.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://fakestoreapi.com")
soup = BeautifulSoup(response.text, "html.parser")
uls = soup.find_all("ul", id="menu")
for ul in uls:
print("-------------")
print(ul.text)
Since there is only one menu on the page, our output is exactly the same as it was when using find()
.
-------------
Home
Docs
GitHub
Buy me a coffee
Find by Text
Now, we’ll extract items from a page using their text. We’ll use the string
argument. In the example below, we find all a
elements containing the string
: Login
. Once again, there’s only one.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
login_buttons = soup.find_all("a", string="Login")
for button in login_buttons:
print("-------------")
print(button)
Your output should look like this.
-------------
<a href="/login">Login</a>
Find by Attribute
When you move on to scraping in the wild, you’ll often need to use other attributes to extract items from the page. Remember how messy output from the first example? In this next snippet, we’ll use the itemprop
attribute and only extract the quotes this time.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
clean_quotes = soup.find_all("span", attrs={"itemprop": "text"})
for quote in clean_quotes:
print("-------------")
print(quote.text)
Look how clean our output is!
-------------
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
-------------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
-------------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
-------------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
-------------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
-------------
“Try not to become a man of success. Rather become a man of value.”
-------------
“It is better to be hated for what you are than to be loved for what you are not.”
-------------
“I have not failed. I've just found 10,000 ways that won't work.”
-------------
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
-------------
“A day without sunshine is like, you know, night.”
Find Using Multiple Criteria
This time, we’ll use the attrs
argument in a more complex way. Here, we find all small
elements that have a class
of author
and an itemprop
of author
. We do this by passing both attributes into our attrs
dictionary.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
authors = soup.find_all("small", attrs={"class": "author", "itemprop": "author"})
for author in authors:
print("-------------")
print(author.text)
Here’s our list of authors in the console.
-------------
Albert Einstein
-------------
J.K. Rowling
-------------
Albert Einstein
-------------
Jane Austen
-------------
Marilyn Monroe
-------------
Albert Einstein
-------------
André Gide
-------------
Thomas A. Edison
-------------
Eleanor Roosevelt
-------------
Steve Martin
Advanced Techniques
Here are some more advanced techniques. In the examples below, we use find_all()
but these methods are equally compatible when using find()
. Just remember, do you want a single element, or a list of them?
Regex
Regex is a very powerful tool for string matching. In this code example, we combine it with the string
article to find all elements containing einstein
, regardless of their capitalization.
import requests
import re
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
pattern = re.compile(r"einstein", re.IGNORECASE)
tags = soup.find_all(string=pattern)
print(f"Total Einstein quotes: {len(tags)}")
There are 3 quotes found on the page.
Total Einstein quotes: 3
Custom Functions
Now, let’s write a custom function to return all actual quotes from Einstein. In the example below, we expand on the regex. We use the parent
method to traverse and find the card containing the quote. Next, we find all the spans. The first span
on the card contains the actual quote. We print its contents to the console.
import requests
import re
from bs4 import BeautifulSoup
def find_einstein_quotes(http_response):
soup = BeautifulSoup(http_response.text, "html.parser")
#find all einstein tags
pattern = re.compile(r"einstein", re.IGNORECASE)
tags = soup.find_all(string=pattern)
for tag in tags:
#follow the parents until we have the quote card
full_card = tag.parent.parent.parent
#find the spans
spans = full_card.find_all("span")
#print the first span, it contains the actual quote
print(spans[0].text)
if __name__ == "__main__":
response = requests.get("https://quotes.toscrape.com")
find_einstein_quotes(response)
Here is our output.
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“Try not to become a man of success. Rather become a man of value.”
Bonus: Find Using CSS Selectors
BeautifulSoup’s select
method works almost exactly like find_all()
, but it’s a bit more flexible. This method takes in a CSS Selector. If you can write a selector, you can find it. In this code, we find all of our authors using multiple attributes again. However, we can pass these in as a single selector.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
authors = soup.select("small[class='author'][itemprop='author']")
for author in authors:
print("-------------")
print(author.text)
Here is our output.
-------------
Albert Einstein
-------------
J.K. Rowling
-------------
Albert Einstein
-------------
Jane Austen
-------------
Marilyn Monroe
-------------
Albert Einstein
-------------
André Gide
-------------
Thomas A. Edison
-------------
Eleanor Roosevelt
-------------
Steve Martin
Conclusion
Now you know just about every aspect of find()
and find_all()
in BeautifulSoup. You don’t need to master all of these methods. The large variety of find methods allow you to choose what you’re comfortable with. Most importantly, you can use them to extract data from any web page. In production, especially for fast and reliable results with a high success rate, you might want to consider our Residential Proxies or even Scraping Browser that has a built-in proxy management system and CAPTCHA solving capabilities.
Sign up and start your free trial today to find the perfect product for your needs.
No credit card required