How To Scrape GitHub Repositories in Python

Ever wondered how to scrape GitHub? Find it out here!
16 min read
Github scraping guide

This tutorial will cover:

Why scrape GitHub repositories?

The are several reasons to scrape GitHub repositories. The most popular ones are:

  • Follow technology trends: By keeping track of the repository stars and releases, you can keep track of current trends in programming languages, frameworks, and libraries. Scraping GitHub allows you to analyze which technologies are gaining popularity to monitor their growth and identify emerging trends. This data can guide decisions about technology adoption, skill development, and resource allocation.
  • Gain access to a rich programming knowledge base: GitHub is a treasure trove of open-source projects, code samples, and solutions. This means you can gather a vast amount of programming knowledge and best practices from the platform. That is useful for educational purposes, improving coding skills, and understanding how different technologies are implemented.
  • Get insights into collaborative development: Repositories offer insights into how developers collaborate through pull requests, issues, and discussions. By collecting this data, you can study collaboration patterns to help you devise teamwork strategies, improve project management, and perfect software development processes.

GitHub is not the only cloud-based platform for hosting git repositories, and there are many alternatives. However, it remains the preferred choice for data scraping because of its:

  1. Large user base
  2. High user activity
  3. Established reputation

In particular, GitHub data is valuable for monitoring tech trends, discovering libraries and frameworks, and improving the software development process. This information plays a key role in staying ahead of the competition in the IT world.

GitHub scraping libraries and tools

Python is widely regarded as an excellent language for web scraping thanks to its straightforward syntax, developer-friendly nature, and extensive range of libraries. Here is why it is the recommended programming language for scraping GitHub. Find out more in our in-depth guide on how to do web scraping with Python.

The next step is to select the most suitable scraping libraries from the wide range of options available. To make an informed decision, you should first explore the platform in your browser. Open the DevTools and take a look at the AJAX calls made by the repository pages on GitHub. You will notice that the majority of them can be ignored. In fact, most of the page data is embedded in the HTML documents returned by the server.

This implies that a library to make HTTP requests to the server combined with an HTML parser will be enough for the task. So, you should opt for:

  • Requests: The most popular HTTP client library in the Python ecosystem. It streamlines the process of sending HTTP requests and handling their corresponding responses. 
  • Beautiful Soup: A comprehensive HTML and XML parsing library. It provides robust DOM navigation and data extraction API for web scraping.

Thanks to Requests and Beautiful Soup, you can effectively perform GitHub scraping using Python. Let’s jump into the details of how to accomplish that!

Build a GitHub Repo Scraper With Beautiful Soup

Follow this step-by-step tutorial and learn how to scrape GitHub in Python. Want to skip the whole coding and scraping process? Purchase a GitHub dataset instead.

Step 1: Python project setup

Before getting started, make sure you meet the prerequisites below:

You now have everything required to set up a project in Python!

Launch the following commands in the terminal to create a github-scraper folder and initialize it with a Python virtual environment


mkdir github-scraper
cd github-scraper
python -m venv env

On Windows, run the command below to activate the environment:

env\Scripts\activate.ps1

While on Linux or macOS, execute:

./env/bin/activate

Then, add a scraper.py file containing the line below in the project folder:

print('Hello, World!')

Right now, your GitHub scraper only prints “Hello, World!” but it will soon contain the logic to extract data from public repositories.

You can launch the script with:

python scraper.py

If all went as planned, it should print this message in the terminal:

Hello, World!

Now that you know it works, open the project folder in your favorite Python IDE.

Fantastic! Get ready to write some Python code. 

Step 2: Install the scraping libraries

As mentioned before, Beautiful Soup and Requests help you perform web scraping on GitHub. In the activated virtual environment, execute the following command to add them to the project’s dependencies:

pip install beautifulsoup4 requests

Clear scraper.py and then import the two packages with these lines: 

import requestsfrom bs4 import BeautifulSoup
# scraping logic...

Make sure that your Python IDE does not report any errors. You may get some warnings because of unused imports. Ignore them, as you are about to use these scraping libraries to extract repository data from GitHub!

Step 3: Download the target page

Select a GitHub repository you want to retrieve data from. In this guide, you will see how to scrape the luminati-proxy repository. Keep in mind that any other repository will do, as the scraping logic will be the same.

Here is what the target page looks like in the browser:

GitHub Repository Selection Guide

Store the URL of the target page in a variable:

url = 'https://github.com/luminati-io/luminati-proxy'

Then, download the page with requests.get():

page = requests.get(url)

Behind the scenes, requests makes an HTTP GET request to that URL and saves the response produced by the server in the page variable. What you should focus your attention on is its text attribute. This contains the HTML document associated with the target webpage. Verify that with a simple print instruction:

print(page.text)

Run the scraper and you should see this in the terminal:


<!DOCTYPE html>
<html lang="en" data-color-mode="dark" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="false">
      <head>
        <meta charset="utf-8">
      <link rel="dns-prefetch" href="https://github.githubassets.com">
      <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
      <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
      <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
      <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
      <link rel="preconnect" href="https://avatars.githubusercontent.com">
<!-- Omitted for brevity... -->
Awesome! Let’s now learn how to parse this

Step 4: Parse the HTML document

To parse the HTML document retrieved above, pass it to Beautiful Soup:

soup = BeautifulSoup(page.text, 'html.parser')

The BeautifulSoup() constructor takes two arguments:

  1. The string containing the HTML content: Here stored in the page.text variable.
  2. The parser that Beautiful Soup will use: “html.parser” is the name of the Python built-in HTML parser.

BeautifulSoup() will parse the HTML and return an explorable tree structure. In detail, the soup variable provides effective methods for selecting elements from the DOM tree, such as:

  • find(): Returns the first HTML element matching the selector strategy passed as a parameter.
  • find_all(): Returns the list of HTML elements matching the input selector strategy.
  • select_one(): Returns the first HTML element matching the CSS selector passed as a parameter.
  • select(): Returns the list of HTML elements matching the input CSS selector.

Note that these methods can also be called on a single node in the tree. In addition to them, a Beautiful Soup node object also exposes:

  • find_next_sibling(): Returns the first HTML node within the element’s siblings that matches the given CSS selector.
  • find_next_siblings(): Returns all HTML nodes within the element’s siblings matching the CSS selector passed as a parameter.

Thanks to these functions, you are ready to scrape GitHub. Let’s see how!

Step 5: Familiarize yourself with the target page

Before diving into coding, there is another crucial step to complete. Scraping data from a site is about selecting the HTML elements of interest and extracting data from them. Defining an effective selection strategy is not always easy, and you must spend some time analyzing the structure of your target webpage. 

Thus, open the GitHub target page in the browser and familiarize yourself with it. Right-click and select “Inspect” to open the DevTools:

Inspecting GitHub Page with DevTools

Digging into the HTML code, you will notice that the site does not qualify many of its elements with unique classes or attributes. So, it is usually difficult to navigate to the desired element and you may need to go through siblings in a tricky way.

Do not worry, though. Devising effective selector strategies for GitHub might not be easy, but neither impossible. Continue to inspect the page in the DevTools until you feel ready to scrape it!

Step 6: Extract the repo data

The goal is to extract useful data from the GitHub repository, such as stars, description, last commit, and so on. So, you need to initialize a Python dictionary to keep track of this data. Add to your code:

repo = {}

First, inspect the name element:

Initializing Python Dictionary for Repo Data Extraction

Note that it has a distinctive itemprop="name" attribute. Select it and extract its text content with:

name_html_element = soup.select_one('[itemprop="name"]')name = name_html_element.text.strip()

Given a Beautiful Soup node, use the get_text() method to access its text content. 

If you inspect name_html_element.text in the debugger, you will see:

\nluminati-proxy\n

GitHub text fields tend to contain spaces and newlines. Get rid of them with the strip() Python function.

Right below the repo name, there is the branch selector:

Extracting Repository Name Using Beautiful Soup

Note that there is no easy way to select the HTML element storing the name of the main branch. What you can do is select the .octicon-git-branch node and then look for the target span in its siblings:

git_branch_icon_html_element = soup.select_one('.octicon-git-branch')
main_branch_html_element = git_branch_icon_html_element.find_next_sibling('span')
main_branch = main_branch_html_element.get_text().strip()

The pattern for reaching an element of interest through the siblings of an icon is quite effective on GitHub. You will see it used several times in this section.

Now, focus on the branch header:

Selecting Main Branch Name via Sibling Element in GitHub

Extract the latest commit time with:

relative_time_html_element = boxheader_html_element.select_one('relative-time')
latest_commit = relative_time_html_element['datetime']

Given a node, you can access its HTML attributes as in a Python dictionary.

Another important piece of information in this section is the number of commits:

Collect it with the icon pattern described before:

history_icon_html_element = boxheader_html_element.select_one('.octicon-history')
commits_span_html_element = history_icon_html_element.find_next_sibling('span')
commits_html_element = commits_span_html_element.select_one('strong')
commits = commits_html_element.get_text().strip().replace(',', '')

Note that find_next_sibling() gives access only to top-level siblings. To select one of their children, you must first get the sibling element and then call select_one()as done above.

Since numbers over one thousand contain a comma in GitHub, remove it with the replace() Python method.

Next, put your attention on the info box on the right:

Select it with:

bordergrid_html_element = soup.select_one('.BorderGrid')

Inspect the description element:

Selecting and Inspecting the Description Element on GitHub

Again, you can select it through a sibling:

about_html_element = bordergrid_html_element.select_one('h2')
description_html_element = about_html_element.find_next_sibling('p')
description = description_html_element.get_text().strip()

Then, apply the icon pattern to retrieve the repository stars, watches, and forks.

Focus on the icons and then on their text siblings:

star_icon_html_element = bordergrid_html_element.select_one('.octicon-star')
stars_html_element = star_icon_html_element.find_next_sibling('strong')
stars = stars_html_element.get_text().strip().replace(',', '')

eye_icon_html_element = bordergrid_html_element.select_one('.octicon-eye')
watchers_html_element = eye_icon_html_element.find_next_sibling('strong')
watchers = watchers_html_element.get_text().strip().replace(',', '')

fork_icon_html_element = bordergrid_html_element.select_one('.octicon-repo-forked')
forks_html_element = fork_icon_html_element.find_next_sibling('strong')
forks = forks_html_element.get_text().strip().replace(',', '')

Well done! You just scraped a GitHub repository.

Step 7: Scrape the readme

Another essential information to retrieve is the README.md file. This is an optional text file that describes the GitHub repository and explains how to use the code.

If you click on the README.md file and then on the “Raw” button, you will be redireced to the URL below:

https://raw.githubusercontent.com/luminati-io/luminati-proxy/master/README.md
Extracting Star, Watcher, and Fork Counts, and Accessing README.md on GitHub

It can therefore be inferred that the URL of a GitHub repo’s readme file follows the format below:

https://raw.githubusercontent.com/<repo_id>/<repo_main_branch>/README.md

Since you have the <repo_main_branch> info stored in the main_branch variable, you can programmatically build this URL with a Python f-string:

readme_url = f'https://raw.githubusercontent.com/luminati-io/luminati-proxy/{main_branch}/README.md'

Then, use requests to retrieve the readme raw Markdown content:

readme_url = f'https://raw.githubusercontent.com/luminati-io/luminati-proxy/{main_branch}/README.md'
readme_page = requests.get(readme_url)

readme = None
# if there is a README.md file
if readme_page.status_code != 404:
    readme = readme_page.text

Note the 404 check to avoid storing the GitHub 404 page content when the repo does not have a readme file.

Step 8: Store the scraped data

Do not forget to add the scraped data variables to the repo dictionary:

repo['name'] = name
repo['latest_commit'] = latest_commit
repo['commits'] = commits
repo['main_branch'] = main_branch
repo['description'] = description
repo['stars'] = stars
repo['watchers'] = watchers
repo['forks'] = forks
repo['readme'] = readme

Use print(repo) to make sure the data extraction process works as desired. Run the Python GitHub scraper and you will get:

{'name': 'luminati-proxy', 'latest_commit': '2023-08-09T08:25:15Z', 'commits': '1079', 'main_branch': 'master', 'description': 'Luminati HTTP/HTTPS Proxy manager', 'stars': '645', 'watchers': '55', 'forks': '196', 'readme': '# Proxy manager\n\n (omitted for brevity...)'}

Fantastic! You know how to scrape GitHub!

Step 9: Export the scraped data to JSON

The final step is to make the collected data easier to share, read, and analyze. The best way to achieve that is to export the data in a human-readable format, such as JSON:

import json

# ...

with open('repo.json', 'w') as file:
    json.dump(repo, file, indent=4)

Import json from the Python standard library, initialize a repo.json file with open(), and finally use json.ump() to populate it. Check out our guide to learn more about how to parse JSON in Python.

Perfect! It is time to take a look at the entire GitHub Python scraper.

Step 10: Put it all together

This is what the complete scraper.py file looks like:

import requests
from bs4 import BeautifulSoup
import json

# the URL of the target repo to scrape
url = 'https://github.com/luminati-io/luminati-proxy'

# download the target page
page = requests.get(url)
# parse the HTML document returned by the server
soup = BeautifulSoup(page.text, 'html.parser')

# initialize the object that will contain
# the scraped data
repo = {}

# repo scraping logic
name_html_element = soup.select_one('[itemprop="name"]')
name = name_html_element.get_text().strip()

git_branch_icon_html_element = soup.select_one('.octicon-git-branch')
main_branch_html_element = git_branch_icon_html_element.find_next_sibling('span')
main_branch = main_branch_html_element.get_text().strip()

# scrape the repo history data
boxheader_html_element = soup.select_one('.Box .Box-header')

relative_time_html_element = boxheader_html_element.select_one('relative-time')
latest_commit = relative_time_html_element['datetime']

history_icon_html_element = boxheader_html_element.select_one('.octicon-history')
commits_span_html_element = history_icon_html_element.find_next_sibling('span')
commits_html_element = commits_span_html_element.select_one('strong')
commits = commits_html_element.get_text().strip().replace(',', '')

# scrape the repo details in the right box
bordergrid_html_element = soup.select_one('.BorderGrid')

about_html_element = bordergrid_html_element.select_one('h2')
description_html_element = about_html_element.find_next_sibling('p')
description = description_html_element.get_text().strip()

star_icon_html_element = bordergrid_html_element.select_one('.octicon-star')
stars_html_element = star_icon_html_element.find_next_sibling('strong')
stars = stars_html_element.get_text().strip().replace(',', '')

eye_icon_html_element = bordergrid_html_element.select_one('.octicon-eye')
watchers_html_element = eye_icon_html_element.find_next_sibling('strong')
watchers = watchers_html_element.get_text().strip().replace(',', '')

fork_icon_html_element = bordergrid_html_element.select_one('.octicon-repo-forked')
forks_html_element = fork_icon_html_element.find_next_sibling('strong')
forks = forks_html_element.get_text().strip().replace(',', '')

# build the URL for README.md and download it
readme_url = f'https://raw.githubusercontent.com/luminati-io/luminati-proxy/{main_branch}/README.md'
readme_page = requests.get(readme_url)

readme = None
# if there is a README.md file
if readme_page.status_code != 404:
    readme = readme_page.text

# store the scraped data 
repo['name'] = name
repo['latest_commit'] = latest_commit
repo['commits'] = commits
repo['main_branch'] = main_branch
repo['description'] = description
repo['stars'] = stars
repo['watchers'] = watchers
repo['forks'] = forks
repo['readme'] = readme

# export the scraped data to a repo.json output file
with open('repo.json', 'w') as file:
    json.dump(repo, file, indent=4)

In less than 100 lines of code, you can build a web spider to collect repo data.

Run the script with:

python scraper.py

Wait for the scraping process to complete, and you will find a repo.json file in the root folder of your project. Open it, and you will see:

{
    "name": "luminati-proxy",
    "latest_commit": "2023-08-09T08:25:15Z",
    "commits": "1079",
    "main_branch": "master",
    "description": "Luminati HTTP/HTTPS Proxy manager",
    "stars": "645",
    "watchers": "55",
    "forks": "196",
    "readme": "# Proxy manager\n\n[![dependencies Status](https://david-dm.org/luminati-io/luminati-proxy/status.svg)](https://david-dm.org/luminati-io/luminati-proxy)\n[![devDependencies Status](https://david-dm.org/luminati-io/luminati-proxy/dev-status.svg)](https://david-dm..."
}

Congrats! You started from raw data contained in a webpage and now have semi-structured data in a JSON file. You just learned how to build a GitHub repo scraper in Python!

Conclusion

In this step-by-step guide, you understood the reasons behind you should build a GitHub repo scraper. Specifically, you learned how to scrape GitHub in a guided tutorial. As shown here, that takes only a few lines of code.

At the same time, more and more sites are adopting anti-scraping technologies. These can identify and block requests through IP banning and rate limiting, preventing your scraper from accessing the site. The best way to avoid them is a proxy. Explore Bright Data’s vast offer of top-notch proxy services and the dedicated GitHub proxies.

Bright Data controls the best proxies for web scraping, serving Fortune 500 companies and over 20,000 customers. Its worldwide proxy network involves:

Overall, Bright Data one of the largest and most reliable scraping-oriented proxy networks on the market. Talk to one of our sales reps and see which of Bright Data’s products best suits your needs.