At face value, Shopify stores represent one of the most difficult challenges in data extraction. The product below represents a typical Shopify listing. The data is just about as nested as it gets.
It’s not impossible to extract data from the HTML above, but there’s an easier way.
Shopify Landing Pages
At https://hiutdenim.co.uk/, their landing page contains some product information, but it’s relatively limited. Scroll down far enough, and you’ll reach it.
At first glance, it seems like you’ll need to scrape every link to every section, then subsequently get and parse all these different pages. Shopify stores don’t follow the traditional methods involved in eCommerce scraping due to unique page layouts. However, there’s another way.
Shopify JSON Pages
You read that headline correctly. We can get all of the store’s products as a JSON object by default. We don’t even need BeautifulSoup or Selenium.
We just need to add /products.json
to our URL. Every Shopify site is built on top of a products.json
file.
If we can request this content (which we can), we can get all the data we could possible want. Once we’ve got it, we just need to decide which data we want to keep. You can verify this for the site we’ve been using here.
Scraping Shopify In Python
Now that we know what we’re looking for, this daunting task becomes far less difficult. Because we’re only dealing with JSON data, we have one dependency we need to install, Python Requests.
Individual Functions
Let’s take a look at the individual code pieces. We’ve got three separate chunks that make up the scraper.
Here’s our most important function. It actually performs the scraping logic.
- First, we append
products.json
to our url:json_url = f"{url}products.json"
. - We initialize an empty array,
items
. As we scrape our items, we’re going to append them to this array. Once the scrape is finished, we return the array of parsed items. - As long as we receive a good response, we retrieve the
"products"
key to get all of our products. - We pull various pieces of data from each product to create a
dict
,product_data
. product_data
gets appended to the array.- This process repeats until we’ve parsed all the products from the page.
We now have a function that performs our scrape and returns an array of products. Now, we need one that takes this array of products and writes it to a file. We could use CSV here, however this structure gets pretty nested, so we’ll use JSON. It supports more flexible data structures for later use and analysis.
That’s the actual code that we’re going to use. Now, we create a main
block to run our scraper.
Putting Everything Together
When we put it all together, our scraper looks like this. What seemed like an intricate parsing project is now a fully functional scraper that only takes up about 50 lines code.
The Return Data
Our data gets returned in an array of JSON objects. Each product holds a list of variants
and images
. These would be pretty difficult to accurately represent in CSV. The snippet you see below is one single product from our scrape.
Advanced Techniques
The world isn’t perfect and it’s possible that you run into difficulty with the scraper above. you might need to scrape multiple pages, or you sometimes your scraper might get blocked.
Pagination
When you’re scraping larger stores, you’ll often run into stores with paginated results. To handle pagination, first, we want the maximum results per page. We can add the following query param: page=<PAGE_NUMBER>
to control our result pages.
We can slightly modify our scraping function to take a page in the URL and the page number.
Then, we can adjust our main
to reflect these changes.
Proxy Integration
Sometimes you might need to use a proxy service to prevent your scraper from getting blocked. With our Shopify Proxies, it’s as simple as creating a URL with your credentials.
Other Solutions from Bright Data
Bright Data offers powerful turnkey alternatives that eliminate the need to build complex scrapers from scratch. Use our fully optimized Shopify Scraper for seamless data extraction or access our extensive library of pre-collected datasets available in multiple formats to jumpstart your projects immediately.
Conclusion
Scraping a Shopify store doesn’t need to be an impossible task. By simply leveraging their API with products.json
, you can harvest a large amount of detailed product data quickly. You don’t even need to use an HTML parser! If you want, you can reduce development time with one of our premade scrapers, or you can get to work immediately with our datasets.
All our products come with a free trial, sign up now!
No credit card required