Python and JavaScript dominate the entire scraping industry. If you need performance or portability, Scala offers a strong alternative. Scala gives us a compiled, portable and strongly typed foundation to work with.
Today, we’re going over how to scrape using Scala and jsoup. While it’s not written about as often as web scraping with Python, Scala provides a strong foundation and decent scraping tools.
Why Scala?
There are quite a few reasons you might choose Scala over Python or JavaScript.
- Performance: Scala is compiled to the JVM(Java Virtual Machine). Compilers translate our code into machine-executable bytecode. This makes it inherently faster than Python.
- Static Typing: Type checking offers an additional layer of safety. Many common bugs get caught before the program will even run.
- Portability: Scala gets compiled to JVM (Java Virtual Machine) bytecode. JVM bytecode can run anywhere that Java is installed.
- Fully Compatible With Java: You can use Java dependencies in your Scala code. This greatly broadens the ecosystem available to you.
Getting Started
Before you get started, you need to make sure you’ve got Scala installed. We’ve got instructions below for Ubuntu, macOS, and Windows.
You can view the full documentation on installation here.
Ubuntu
macOS
Windows
Download the Scala installer for Windows.
Creating a Scraper
Make a new project folder and cd
into it.
Initialize a new Scala project. The command converts our new folder into a Scala project and creates a build.sbt
file to hold our dependencies.
Now, open up build.sbt
. You’ll need to add jsoup as a dependency. Your complete build file should look like this.
Next, copy and paste the code below into your Main.scala
file.
Running the Scraper
To run our scraper, run the following command from the root of the project.
You should see an output similar to the one below.
Selection With jsoup
To find page elements with jsoup, we use the select()
method. select()
returns a list of all elements matching our selector. Let’s look at how this works in our Quote Scraper project.
In this line, we use document.select(".quote")
to return all page elements with a class
of quote
.
We could also write these selectors with more structure: element[attribute='some value']
. This allows us to apply stronger filters when searching for objects on the page.
The line below would still return the same page objects, but it’s much more expressive.
Let’s look at a couple other instances of select()
from our code. Since there is only one text
element and one author
in each quote, select()
only returns one text object and one author. If our quote element contained multiple texts or authors, it would return all texts and authors for each quote.
Extraction With jsoup
To extract data with jsoup, we can use the following methods:
text()
: Extract the text from a list of page elements. When you’re scraping prices from a website, they show up on the page as text.attr()
: Extract a specific attribute from a single page element. These are pieces of data located within the HTML tags. This method is commonly used to extract links from a website.
text()
We saw examples of this with our initial scraper. text()
returns the text of any elements we call it on. If the example below was to find two authors, text()
would extract both of their text and combine them into a single string.
attr()
The attr()
method behaves differently from text()
. This method extracts a single attribute from a single page item.
With this line added in, our output now looks like this.
Alternative Web Scraping Tools
- Scraping Browser: A remote browser fully integrated with proxies that you can use from Playwright and Selenium.
- Web Scraper APIs: Automate your scraping process by calling one of our APIs. When you call a scraper API, we scrape a site and send the data back to you.
- No Code Scraper: Tell us what site you want to scrape and which data you want. We’ll handle the rest.
- Datasets: Our datasets are perhaps the easiest of any extraction method. We scrape hundreds of sites and update our databases all the time. Datasets give you a clean set of data that’s ready for analysis.
Conclusion
Web scraping is pretty intuitive with Scala. You learned how to select page elements and extract their data using jsoup. If scraping isn’t your thing, you can always use one of our automated tools to guide the process along or entirely skip the scraping process with our ready-to-use datasets.
Sign up now and start your free trial today!
No credit card required