How To Scrape a Next.js Site In Seconds Through Hydration Data

Discover efficient web scraping techniques for Next.js sites by leveraging React hydration in this concise tutorial.
3 min read
Scraping dynamic websites with next.js main image

In this tutorial on tricks for scraping Next.js sites, you will learn:

  • What is Next and why it is so popular
  • Why it is easy to scrape Next.js web pages thanks to how React hydration works
  • How to harness React hydration for web scraping

Let’s dive in!

What Is Next.js and How Does It Work?

Next.js is a JavaScript framework built on top of React for building server-side rendered and statically generated websites. It simplifies the development process by providing a rich API and a structured approach to building server-side React applications.

Next.js has gained a lot of popularity over the years, becoming the fifth most used web library according to Statista. This is due to its ease of use, great performance, similarities with React, extensive documentation, and community support. No wonder many large companies and startups choose Next.js for their web development needs.

At a high level, Next.js works by retrieving data on the server and passing it to React components to create pre-rendered HTML documents. This process enhances performance by generating HTML content on the server, which can then be sent to the client for faster initial page loads.

How To Take Advantage of React Hydration for Web Scraping

Hydration bridges the gap between server-side and client-side rendering. In detail, Next.js hydration is the process through which the HTML document generated by Next.js is converted into a fully functioning client-side React application.

During hydration—after the browser loads the HTML page returned by the server—React adds interactivity to the page. Specifically, it attaches event listeners and handles state in the DOM nodes that correspond to the React components rendered on the server.

These are the steps required by React to hydrate a pre-rendered page:

  1. Initial server rendering: The server generates the HTML document with the HTML representation of the React components used on the page.
  2. Client-side JavaScript execution: When the client receives the HTML markup, it runs the JavaScript bundle containing the React code.
  3. Reconciliation: React compares the HTML returned by the server with the virtual DOM representation generated on the fly. Learn more on the official docs.
  4. Hydration: If the two are the same, React completes the rendering by adding event handlers and handling state while reusing as much of the existing DOM as possible.

To perform this operation, React needs the same data used by the server to generate the HTML document. Here is why Next.js adds some special DOM elements containing the props data to the generated page.

On some Next.js sites, you can find this data in the <script> element with the __NEXT_DATA__ ID. This special DOM node contains data in JSON format that React uses for hydration, as follows:

Finding the  __NEXT_DATA__ ID

On recent Next.js sites that utilize the new App Router, hydration data is instead stored in the self.__next_f.push() function calls in multiple <script> nodes:

self.__next_f.push() calls in script nodes

Note that these nodes may contain even more data than what is displayed on the site. How is this possible? Because those hydration elements store all API and database data retrieved from the server during page generation and passed to React components. However, not all attributes of those objects may actually be accessed and used in the components.

Now, it does not matter if you actually understood why that data needs to be there for React to work. What matters is that the web pages generated via Next.js contain the data to be rendered in JSON format within special DOM nodes. As you can imagine, this has huge implications for Nex.js web scraping!

Scraping Next.js Sites Through the Hydration Data

Extracting data from a page built in Next.js is so easy that you do not even need a scraping script. The DevTools of your browser will be enough.

Let’s now see how to take advantage of React hydration to scrape Next.js sites in seconds!

Extracting Data From __NEXT_DATA__

Suppose you verified that the target page to scrape is built with Next.js (find out how in the FAQ question).

Now, visit the page in your browser, right-click, and select “Inspect” to reach the DevTools. Move to the Console tab and run the JavaScript line below to select the desired <script> element:

const scriptNode = document.querySelector("#__NEXT_DATA__")

This will use the querySelector() function to select the element in the DOM with ID __NEXT_DATA__ and assign it to the scriptNode variable.

If you type scriptNode in the console and press Enter, you will get the desired node:

Typing scriptnode in console

Access its inner HTML content and parse it as JSON content with:

const jsonData = JSON.parse(scriptNode.innerHTML)

Et voilà! The jsonData object will now contain all the data React used to render the components on the page:

jsondata object with all the data

In detail, focus on the pageProps field inside props:

jsonData.props.pageProps
The pageprops field in the console

Next, right-click on the object and select the “Copy object” option:

Copying the object from the console tab

Lastly, create a data.json file and paste the desired content to it!

Great! You just performed web scraping on a Next.js site in under a minute.

Put it all together, and you will get this Next.js scraping script:

const scriptNode = document.querySelector("#__NEXT_DATA__")

const jsonData = JSON.parse(scriptNode.innerHTML)

jsonData.props.pageProps

Retrieving Data From self.__next_f.push Functions

Next.js 13 introduced the App Router. This changes the way Next.js delivers the data to React for hydration. In this case, you need to select all <script> nodes that contain the self.__next_f.push string.

Again, visit the target page in the browser and reach the console. Launch the command below to select the <script> nodes:

const scriptNodes = document.querySelectorAll("script")

querySelectorAll() returns a NodeList object. Convert it to an array with Array.from() to apply the filter() method and get only the nodes of interest:

const hydrationScriptNodes = Array.from(scriptNodes).filter((e) => e.innerHTML.includes("self.__next_f.push"))

Now, the hydrationScriptNodes will contain all hydration <script> elements on the page:

the hydrationScriptNodes containing all hydration elements

However, you generally want only the node that has the initialTree attribute. This is where all hydration data of interest is stored:

Finding the hydration data of interest

Select it with:

const scriptNode = hydrationScriptNodes.find((e) => e.innerHTML.includes("initialTree"))

Then, extract the data of interest with:

scriptNode.innerHTML

Note that the retrieved data contains the info of interest but requires additional parsing. You can convert it to a more readable format with a few extra lines.

converting the data to a more readable format

This time, the Next.js scraping script is:

const scriptNodes = document.querySelectorAll("script")

const hydrationScriptNodes = Array.from(scriptNodes).filter((e) => e.innerHTML.includes("self.__next_f.push"))

const scriptNode = hydrationScriptNodes.find((e) => e.innerHTML.includes("initialTree"))

scriptNode.innerHTML

Congratulations! Scraping Next.js sites has never been easier!

Limitations of This Next.js Scraping Approach

While this scraping approach based on React hydration data is quick and effective, it comes with some limitations. These are:

  • Partial data: The special nodes <script> nodes added by Next.js only contain the data retrieved by the server and passed to the React components during hydration. That may not be the entire data contained on the page. This is because React components can have hardcoded values or retrieve other data dynamically via AJAX. In this case, you need to perform web scraping with a browser automation tool.
  • Extra parsing required: self.__next_f.push involves data in a proprietary format, and parsing it correctly may not always be easy.
  • Requires manual operations: Unless you translate the scripts written above into scraping scripts in JavaScript, Python, or similar language and integrate the logic for data export, you must export the data manually into a text file. Find out more in our web scraping with JavaScript and Node.js guide.

Conclusion

In this article, you learned what Next.js is, why it is one of the most widely used technologies in the world for building sites, and how to scrape data from it. In particular, you realized that it relies on React hydration and what that entails. Because of that, the HTML pages returned by the server already contain all the data you need (and even in JSON format!). This makes web scraping Next.js sites very easy.

The real problem is another one: getting blocked by anti-bot technologies. These systems can detect and block your automated scraping script. Fortunately, Bright Data has several effective solutions for you:

  • Web Scraper IDE: A cloud IDE to build web scrapers that can automatically bypass and avoid any blocks.
  • Scraping Browser: A cloud-based controllable browser that offers JavaScript rendering capabilities while handling CAPTCHAs, browser fingerprinting, automated retries, and more for you. It integrates with the most popular automation browser libraries, such as Playwright and Puppeteer.
  • Web Unlocker: An unlocking API that can seamlessly return the raw HTML of any page, circumventing any anti-scraping measures.

Don’t want to deal with web scraping at all but are still interested in online data? Explore Bright Data’s ready-to-use datasets!

FAQ

Is it possible to hide or remove __NEXT_DATA__ from the DOM in Next.js?

No, you cannot remove or hide that. If you decided to remove the _NEXT_DATA_ <script> element from the DOM, React would not be able to hydrate. As the data in that script is required by React to work properly, you cannot remove it without expecting some malfunction or degradation of functionality. Read the GitHub discussion dedicated to this topic.

Is it possible to remove self.__next_f.push calls from the DOM?

No, you cannot remove the self.__next_f.push calls in the <script> nodes added by Next.js. Those DOM elements are added by the server to make the client-side React application able to hydrate and function as expected. For more details, check out the GitHub discussion dedicated to that topic.

How to tell if a site is built in Next.js?

There are a few ways to tell that a site is built with Next.js. First, look for the X-Powered-By header set by default by some versions of Next.js:

checking the x-powered-by header

Otherwise, check whether the DOM contains a <script id="__NEXT_DATA__" ... > node or some <script>self.__next_f.push(...)</script> nodes.

Is Next.js the only technology that relies on React hydration?

No, Next.js is not the only technology that relies on React hydration. Other server-side rendering (SSR) generators, such as Gatsby, also utilize React hydration to convert server-rendered HTML into interactive React applications on the client side. This process is a common approach in SSR with React and is not limited to Next.js.