How to Parse XML in Python Guide

Learn how to parse XML in Python using libraries like ElementTree, lxml, and SAX to enhance your data processing projects.
2 min read
How to parse XML in Python main blog image

Extensible Markup Language (XML) is a widely used format for storing and exchanging structured data. XML files are commonly used to represent hierarchical data, such as configuration files, data interchange formats, web service responses, and web sitemaps.

Parsing XML files in Python is a common task, especially for automating manual processes like processing data retrieved from web APIs or web scraping.

In this article, you’ll learn about some of the libraries that you can use to parse XML in Python, including the ElementTree modulelxml libraryminidomSimple API for XML (SAX), and untangle.

Key Concepts of an XML File

Before you learn how to parse XML in Python, you must understand what XML Schema Definition (XSD) is and what elements make up an XML file. This understanding can help you select the appropriate Python library for your parsing task.

XSD is a schema specification that defines the structure, content, and data types allowed in an XML document. It serves as a syntax for validating the structure and content of XML files against a predefined set of rules.

An XML file usually includes the elements Namespacerootattributeselements, and text content, which collectively represent structured data.

  • Namespace allows elements and attributes in XML documents to be uniquely identified. Namespace helps avoid naming conflicts and enables interoperability between XML documents.
  • root is the top-level element in an XML document. It serves as the starting point for navigating the XML structure and contains all other elements as its children.
  • attributes provide additional information about the element. They’re specified within the start tag of an element and consist of a name-value pair.
  • elements are the building blocks of an XML document and represent the data or structure being described. Elements can be nested within other elements to create a hierarchical structure.
  • text content refers to the textual data enclosed within an element’s start and end tags. It can include plaintext, numbers, or other characters.

For example, the Bright Data sitemap has the following XML structure:

  • urlset is the root element.
  • <urlset xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd> is the namespace declaration specific to the urlset element, implying that this declaration’s rules extend to the urlset element. All elements under it must conform to the schema outlined by this namespace.
  • url is the first child of the root element.
  • loc is the child element of the url element.

Now that you know a little more about XSD and XML file elements, let’s use that information to help parse an XML file with a few libraries.

Various Ways to Parse XML in Python

For demonstration purposes, you’ll use the Bright Data sitemap for this tutorial, which is available in XML format. In the following examples, the Bright Data sitemap content is fetched using the Python requests library.

The Python requests library is not built-in, so you need to install it before proceeding. You can do so using the following command:

pip install requests

ElementTree

The ElementTree XML API provides a simple and intuitive API for parsing and creating XML data in Python. It’s a built-in module in Python’s standard library, which means you don’t need to install anything explicitly.

For example, you can use the findall() method to find all the url elements from the root and print the text value of the loc element, like this:

import xml.etree.ElementTree as ET
import requests

url = 'https://brightdata.com/post-sitemap.xml'

response = requests.get(url)
if response.status_code == 200:
   
    root = ET.fromstring(response.content)

    for url_element in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
        loc_element = url_element.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc')
        if loc_element is not None:
            print(loc_element.text)
else:
    print("Failed to retrieve XML file from the URL.")

All the URLs in the sitemap are printed in the output:

https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations

ElementTree is a user-friendly way to parse XML data in Python, featuring a straightforward API that makes it easy to navigate and manipulate XML structures. However, ElementTree does have its limitations; it lacks robust support for schema validation and is not ideal if you need to ensure strict adherence to a schema specification before parsing.

If you have a small script that reads an RSS feed, the user-friendly API of ElementTree would be a useful tool for extracting titles, descriptions, and links from each feed item. However, if you have a use case with complex validation or massive files, it would be better to consider another library like lxml.

lxml

lxml is a fast, easy-to-use, and feature-rich API for parsing XML files in Python; however, it’s not a prebuilt library in Python. While some Linux and Mac platforms have the lxml package already installed, other platforms need manual installation.

lxml is distributed via PyPI and you can install lxml using the following pip command:

pip install lxml

Once installed, you can use lxml to parse XML files using various API methods, such as find()findall()findtext()get(), and get_element_by_id().

For instance, you can use the findall() method to iterate over the url elements, find their loc elements (which are child elements of the url element), and then print the location text using the following code:

from lxml import etree
import requests

url = "https://brightdata.com/post-sitemap.xml"

response = requests.get(url)
if response.status_code == 200:

    root = etree.fromstring(response.content)
    

    for url in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}url"):
        loc = url.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text.strip()
        print(loc)
else:
    print("Failed to retrieve XML file from the URL.")

The output displays all the URLs found in the sitemap:

https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations

So far, you’ve learned how to find elements and print their value. Now, let’s explore schema validation before parsing the XML. This process ensures that the file conforms to the specified structure defined by the schema.

The XSD for the sitemap looks like this:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           targetNamespace="http://www.sitemaps.org/schemas/sitemap/0.9"
           xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
           elementFormDefault="qualified"
           xmlns:xhtml="http://www.w3.org/1999/xhtml">

  
  <xs:element name="urlset">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="url" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  
  <xs:element name="url">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="loc" type="xs:anyURI"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>

To use the sitemap for schema validation, make sure you copy it manually and create a file named schema.xsd.

To validate the XML file using this XSD, use the following code:


from lxml import etree

import requests

url = "https://brightdata.com/post-sitemap.xml"

response = requests.get(url)

if response.status_code == 200:

    root = etree.fromstring(response.content)

    try:
        print("Schema Validation:")
        schema_doc = etree.parse("schema.xsd")  
        schema = etree.XMLSchema(schema_doc)  
        schema.assertValid(root)  
        print("XML is valid according to the schema.")
    except etree.DocumentInvalid as e:
        print("XML validation error:", e)

Here, you parse the XSD file using the etree.parse() method. Then you create an XML Schema using the parsed XSD doc content. Finally, you validate the XML root document against the XML schema using the assertValid() method. If the schema validation passes, your output includes a message that says something like XML is valid according to the schema. Otherwise, the DocumentInvalid exception is raised.

Your output should look like this:

 Schema Validation:
    XML is valid according to the schema.

Now, let’s read an XML file that uses the xpath method to find the elements using their path.

To read the elements using the xpath() method, use the following code:

from lxml import etree

import requests

url = "https://brightdata.com/post-sitemap.xml"
response = requests.get(url)

if response.status_code == 200:
   
    root = etree.fromstring(response.content)
    
    print("XPath Support:")
    root = etree.fromstring(response.content)

    namespaces = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}
    for url in root.xpath(".//ns:url/ns:loc", namespaces=namespaces):
        print(url.text.strip())

In this code, you register the namespace prefix ns and map it to the namespace URI http://www.sitemaps.org/schemas/sitemap/0.9. In the XPath expression, you use the ns prefix to specify elements in the namespace. Finally, the expression .//ns:url/ns:loc selects all loc elements that are children of url elements in the namespace.

Your output will look like this:

XPath Support:

https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations

As you can see, the find() and findall() methods are faster than the xpath method because xpath collects all the results into the memory before returning them. It’s recommended that you use the find() method unless there is a specific reason for using XPath queries.

lxml offers powerful features for parsing and manipulating XML and HTML. It supports complex queries using XPath expressions, validates documents against schemas, and even allows for eXtensible Stylesheet Language Transformations (XSLT). This makes it ideal for scenarios where performance and advanced functionality are crucial. However, keep in mind that lxml requires a separate installation as it’s not part of the core Python package.

If you’re dealing with large or complex XML data that requires both high performance and advanced manipulation, you should consider using lxml. For instance, if you’re processing financial data feeds in XML format, you might need to use XPath expressions to extract specific elements like stock prices, validate the data against a financial schema to ensure accuracy, and potentially transform the data using XSLT for further analysis.

minidom

minidom is a lightweight and simple XML parsing library that’s included in Python’s standard library. While it’s not as feature-rich or efficient as parsing with lxml, it offers a straightforward way to parse and manipulate XML data in Python.

You can use the various methods available in the DOM object to access elements. For example, you can use the getElementsByTagName() method to retrieve the value of an element using its tag name.

The following example demonstrates how to use the minidom library to parse an XML file and fetch the elements using their tag names:

import requests
import xml.dom.minidom

url = "https://brightdata.com/post-sitemap.xml"

response = requests.get(url)
if response.status_code == 200:
    dom = xml.dom.minidom.parseString(response.content)
    
    urlset = dom.getElementsByTagName("urlset")[0]
    for url in urlset.getElementsByTagName("url"):
        loc = url.getElementsByTagName("loc")[0].firstChild.nodeValue.strip()
        print(loc)
else:
    print("Failed to retrieve XML file from the URL.")

Your output would look like this:

https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations

minidom works with XML data by representing it as a DOM tree. This tree structure makes it easy to navigate and manipulate data, and it’s best suited for basic tasks such as reading, changing, or building simple XML structures.

If your program involves reading default settings from an XML file, the DOM approach of minidom allows you to easily access specific settings within the XML file using methods such as finding child nodes or attributes. With minidom, you can easily retrieve specific settings from the XML file, such as the font-size node, and utilize its value within your application.

SAX Parser

The SAX parser is an event-driven XML parsing approach in Python that processes XML documents sequentially and generates events as it encounters various parts of the document. Unlike DOM-based parsers that construct a tree structure representing the entire XML document in memory, SAX parsers do not build a complete representation of the document. Instead, it emits events such as start tags, end tags, and text content as they parse through the document.

SAX parsers are good for processing large XML files or streams where memory efficiency is a concern as they operate on XML data incrementally without loading the entire document into memory.

When using the SAX parser, you need to define the event handlers that respond to specific XML events, such as the startElement and endElement emitted by the parser. These event handlers can be customized to perform actions based on the structure and content of the XML document.

The following example demonstrates how to parse an XML file using the SAX parser by defining the startElement and endElement events and retrieving the URL information from the sitemap file:

import requests
import xml.sax.handler
from io import BytesIO

class MyContentHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.in_url = False
        self.in_loc = False
        self.url = ""

    def startElement(self, name, attrs):
        if name == "url":
            self.in_url = True
        elif name == "loc" and self.in_url:
            self.in_loc = True

    def characters(self, content):
        if self.in_loc:
            self.url += content

    def endElement(self, name):
        if name == "url":
            print(self.url.strip())
            self.url = ""
            self.in_url = False
        elif name == "loc":
            self.in_loc = False

url = "https://brightdata.com/post-sitemap.xml"

response = requests.get(url)
if response.status_code == 200:

    xml_content = BytesIO(response.content)
    
    content_handler = MyContentHandler()
    parser = xml.sax.make_parser()
    parser.setContentHandler(content_handler)
    parser.parse(xml_content)
else:
    print("Failed to retrieve XML file from the URL.")

Your output would look like this:

https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations

Unlike other parsers that load the entire file into memory, SAX processes files incrementally, conserving memory and enhancing performance. However, SAX necessitates writing more code to manage each data segment dynamically. Additionally, it cannot revisit and analyze specific parts of the data later on.

If you need to scan a large XML file (eg a log file containing various events) to extract specific information (eg error messages), SAX can help you efficiently navigate through the file. However, if your analysis requires understanding the relationships between different data segments, SAX may not be the best choice.

untangle

untangle is a lightweight XML parsing library for Python that simplifies the process of extracting data from XML documents. Unlike traditional XML parsers that require navigating through hierarchical structures, untangle lets you access XML elements and attributes directly as Python objects.

With untangle, you can convert XML documents into nested Python dictionaries, where XML elements are represented as dictionary keys, and their attributes and text content are stored as corresponding values. This approach makes it easy to access and manipulate XML data using Python data structures.

untangle is not available by default in Python and needs to be installed using the following PyPI command:

pip install untangle

The following example demonstrates how to parse the XML file using the untangle library and access the XML elements:

import untangle
import requests

url = "https://brightdata.com/post-sitemap.xml"

response = requests.get(url)

if response.status_code == 200:
  
    obj = untangle.parse(response.text)
    
    for url in obj.urlset.url:
        print(url.loc.cdata.strip())
else:
    print("Failed to retrieve XML file from the URL.")

Your output will look like this:

https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations

untangle offers a user-friendly approach to working with XML data in Python. It simplifies the parsing process with clear syntax and automatically converts the XML structure into easy-to-use Python objects, eliminating the need for complex navigation techniques. However, keep in mind that untangle requires separate installation as it’s not part of the core Python package.

You should consider using untangle if you have a well-formed XML file and need to quickly convert it into Python objects for further processing. For example, if you have a program that downloads weather data in XML format, untangle could be a good fit to parse the XML and create Python objects representing the current temperature, humidity, and forecast. These objects could then be easily manipulated and displayed within your application.

Conclusion

In this article, you learned all about XML files and the various methods for parsing XML files in Python.

Whether you’re working with small configuration files, parsing large web service responses, or extracting data from extensive sitemaps, Python offers versatile libraries to automate and streamline your XML parsing tasks. However, when accessing files from the web using the requests library without proxy management, you may encounter quota exceptions and throttling issues. Bright Data is an award-winning proxy network that provides reliable and efficient proxy solutions to ensure seamless data retrieval and parsing. With Bright Data, you can tackle XML parsing tasks without worrying about limitations or disruptions. Contact our sales team to learn more.

Want to skip the whole scraping and parsing process? Try our dataset marketplace for free!