Extensible Markup Language (XML) is a widely used format for storing and exchanging structured data. XML files are commonly used to represent hierarchical data, such as configuration files, data interchange formats, web service responses, and web sitemaps.
Parsing XML files in Python is a common task, especially for automating manual processes like processing data retrieved from web APIs or web scraping.
In this article, you’ll learn about some of the libraries that you can use to parse XML in Python, including the ElementTree
module, lxml library, minidom
, Simple API for XML (SAX), and untangle.
Key Concepts of an XML File
Before you learn how to parse XML in Python, you must understand what XML Schema Definition (XSD) is and what elements make up an XML file. This understanding can help you select the appropriate Python library for your parsing task.
XSD is a schema specification that defines the structure, content, and data types allowed in an XML document. It serves as a syntax for validating the structure and content of XML files against a predefined set of rules.
An XML file usually includes the elements Namespace
, root
, attributes
, elements
, and text content
, which collectively represent structured data.
Namespace
allows elements and attributes in XML documents to be uniquely identified.Namespace
helps avoid naming conflicts and enables interoperability between XML documents.roo
t
is the top-level element in an XML document. It serves as the starting point for navigating the XML structure and contains all other elements as its children.attributes
provide additional information about the element. They’re specified within the start tag of an element and consist of a name-value pair.elements
are the building blocks of an XML document and represent the data or structure being described. Elements can be nested within other elements to create a hierarchical structure.text content
refers to the textual data enclosed within an element’s start and end tags. It can include plaintext, numbers, or other characters.
For example, the Bright Data sitemap has the following XML structure:
urlset
is theroot
element.<urlset xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd>
is the namespace declaration specific to theurlset
element, implying that this declaration’s rules extend to theurlset
element. All elements under it must conform to the schema outlined by this namespace.url
is the first child of theroot
element.loc
is the child element of theurl
element.
Now that you know a little more about XSD and XML file elements, let’s use that information to help parse an XML file with a few libraries.
Various Ways to Parse XML in Python
For demonstration purposes, you’ll use the Bright Data sitemap for this tutorial, which is available in XML format. In the following examples, the Bright Data sitemap content is fetched using the Python requests library.
The Python requests library is not built-in, so you need to install it before proceeding. You can do so using the following command:
pip install requests
ElementTree
The ElementTree XML API provides a simple and intuitive API for parsing and creating XML data in Python. It’s a built-in module in Python’s standard library, which means you don’t need to install anything explicitly.
For example, you can use the findall()
method to find all the url
elements from the root and print the text value of the loc
element, like this:
import xml.etree.ElementTree as ET
import requests
url = 'https://brightdata.com/post-sitemap.xml'
response = requests.get(url)
if response.status_code == 200:
root = ET.fromstring(response.content)
for url_element in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
loc_element = url_element.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc')
if loc_element is not None:
print(loc_element.text)
else:
print("Failed to retrieve XML file from the URL.")
All the URLs in the sitemap are printed in the output:
https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations
ElementTree is a user-friendly way to parse XML data in Python, featuring a straightforward API that makes it easy to navigate and manipulate XML structures. However, ElementTree does have its limitations; it lacks robust support for schema validation and is not ideal if you need to ensure strict adherence to a schema specification before parsing.
If you have a small script that reads an RSS feed, the user-friendly API of ElementTree would be a useful tool for extracting titles, descriptions, and links from each feed item. However, if you have a use case with complex validation or massive files, it would be better to consider another library like lxml.
lxml
lxml is a fast, easy-to-use, and feature-rich API for parsing XML files in Python; however, it’s not a prebuilt library in Python. While some Linux and Mac platforms have the lxml package already installed, other platforms need manual installation.
lxml is distributed via PyPI and you can install lxml
using the following pip
command:
pip install lxml
Once installed, you can use lxml
to parse XML files using various API methods, such as find()
, findall()
, findtext()
, get()
, and get_element_by_id()
.
For instance, you can use the findall()
method to iterate over the url
elements, find their loc
elements (which are child elements of the url
element), and then print the location text using the following code:
from lxml import etree
import requests
url = "https://brightdata.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
root = etree.fromstring(response.content)
for url in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}url"):
loc = url.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text.strip()
print(loc)
else:
print("Failed to retrieve XML file from the URL.")
The output displays all the URLs found in the sitemap:
https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations
So far, you’ve learned how to find elements and print their value. Now, let’s explore schema validation before parsing the XML. This process ensures that the file conforms to the specified structure defined by the schema.
The XSD for the sitemap looks like this:
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
elementFormDefault="qualified"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xs:element name="urlset">
<xs:complexType>
<xs:sequence>
<xs:element ref="url" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="url">
<xs:complexType>
<xs:sequence>
<xs:element name="loc" type="xs:anyURI"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
To use the sitemap for schema validation, make sure you copy it manually and create a file named schema.xsd
.
To validate the XML file using this XSD, use the following code:
from lxml import etree
import requests
url = "https://brightdata.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
root = etree.fromstring(response.content)
try:
print("Schema Validation:")
schema_doc = etree.parse("schema.xsd")
schema = etree.XMLSchema(schema_doc)
schema.assertValid(root)
print("XML is valid according to the schema.")
except etree.DocumentInvalid as e:
print("XML validation error:", e)
Here, you parse the XSD file using the etree.parse()
method. Then you create an XML Schema using the parsed XSD doc content. Finally, you validate the XML root document against the XML schema using the assertValid()
method. If the schema validation passes, your output includes a message that says something like XML is valid according to the schema
. Otherwise, the DocumentInvalid
exception is raised.
Your output should look like this:
Schema Validation:
XML is valid according to the schema.
Now, let’s read an XML file that uses the xpath
method to find the elements using their path.
To read the elements using the xpath()
method, use the following code:
from lxml import etree
import requests
url = "https://brightdata.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
root = etree.fromstring(response.content)
print("XPath Support:")
root = etree.fromstring(response.content)
namespaces = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}
for url in root.xpath(".//ns:url/ns:loc", namespaces=namespaces):
print(url.text.strip())
In this code, you register the namespace prefix ns
and map it to the namespace URI http://www.sitemaps.org/schemas/sitemap/0.9
. In the XPath
expression, you use the ns
prefix to specify elements in the namespace. Finally, the expression .//ns:url/ns:loc
selects all loc
elements that are children of url
elements in the namespace.
Your output will look like this:
XPath Support:
https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations
As you can see, the find()
and findall()
methods are faster than the xpath
method because xpath
collects all the results into the memory before returning them. It’s recommended that you use the find()
method unless there is a specific reason for using XPath
queries.
lxml offers powerful features for parsing and manipulating XML and HTML. It supports complex queries using XPath expressions, validates documents against schemas, and even allows for eXtensible Stylesheet Language Transformations (XSLT). This makes it ideal for scenarios where performance and advanced functionality are crucial. However, keep in mind that lxml requires a separate installation as it’s not part of the core Python package.
If you’re dealing with large or complex XML data that requires both high performance and advanced manipulation, you should consider using lxml. For instance, if you’re processing financial data feeds in XML format, you might need to use XPath expressions to extract specific elements like stock prices, validate the data against a financial schema to ensure accuracy, and potentially transform the data using XSLT for further analysis.
minidom
minidom
is a lightweight and simple XML parsing library that’s included in Python’s standard library. While it’s not as feature-rich or efficient as parsing with lxml, it offers a straightforward way to parse and manipulate XML data in Python.
You can use the various methods available in the DOM object to access elements. For example, you can use the getElementsByTagName()
method to retrieve the value of an element using its tag name.
The following example demonstrates how to use the minidom
library to parse an XML file and fetch the elements using their tag names:
import requests
import xml.dom.minidom
url = "https://brightdata.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
dom = xml.dom.minidom.parseString(response.content)
urlset = dom.getElementsByTagName("urlset")[0]
for url in urlset.getElementsByTagName("url"):
loc = url.getElementsByTagName("loc")[0].firstChild.nodeValue.strip()
print(loc)
else:
print("Failed to retrieve XML file from the URL.")
Your output would look like this:
https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations
minidom
works with XML data by representing it as a DOM tree. This tree structure makes it easy to navigate and manipulate data, and it’s best suited for basic tasks such as reading, changing, or building simple XML structures.
If your program involves reading default settings from an XML file, the DOM approach of minidom
allows you to easily access specific settings within the XML file using methods such as finding child nodes or attributes. With minidom
, you can easily retrieve specific settings from the XML file, such as the font-size
node, and utilize its value within your application.
SAX Parser
The SAX parser is an event-driven XML parsing approach in Python that processes XML documents sequentially and generates events as it encounters various parts of the document. Unlike DOM-based parsers that construct a tree structure representing the entire XML document in memory, SAX parsers do not build a complete representation of the document. Instead, it emits events such as start tags, end tags, and text content as they parse through the document.
SAX parsers are good for processing large XML files or streams where memory efficiency is a concern as they operate on XML data incrementally without loading the entire document into memory.
When using the SAX parser, you need to define the event handlers that respond to specific XML events, such as the startElement
and endElement
emitted by the parser. These event handlers can be customized to perform actions based on the structure and content of the XML document.
The following example demonstrates how to parse an XML file using the SAX parser by defining the startElement
and endElement
events and retrieving the URL information from the sitemap file:
import requests
import xml.sax.handler
from io import BytesIO
class MyContentHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.in_url = False
self.in_loc = False
self.url = ""
def startElement(self, name, attrs):
if name == "url":
self.in_url = True
elif name == "loc" and self.in_url:
self.in_loc = True
def characters(self, content):
if self.in_loc:
self.url += content
def endElement(self, name):
if name == "url":
print(self.url.strip())
self.url = ""
self.in_url = False
elif name == "loc":
self.in_loc = False
url = "https://brightdata.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
xml_content = BytesIO(response.content)
content_handler = MyContentHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(content_handler)
parser.parse(xml_content)
else:
print("Failed to retrieve XML file from the URL.")
Your output would look like this:
https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations
Unlike other parsers that load the entire file into memory, SAX processes files incrementally, conserving memory and enhancing performance. However, SAX necessitates writing more code to manage each data segment dynamically. Additionally, it cannot revisit and analyze specific parts of the data later on.
If you need to scan a large XML file (eg a log file containing various events) to extract specific information (eg error messages), SAX can help you efficiently navigate through the file. However, if your analysis requires understanding the relationships between different data segments, SAX may not be the best choice.
untangle
untangle is a lightweight XML parsing library for Python that simplifies the process of extracting data from XML documents. Unlike traditional XML parsers that require navigating through hierarchical structures, untangle lets you access XML elements and attributes directly as Python objects.
With untangle, you can convert XML documents into nested Python dictionaries, where XML elements are represented as dictionary keys, and their attributes and text content are stored as corresponding values. This approach makes it easy to access and manipulate XML data using Python data structures.
untangle is not available by default in Python and needs to be installed using the following PyPI
command:
pip install untangle
The following example demonstrates how to parse the XML file using the untangle library and access the XML elements:
import untangle
import requests
url = "https://brightdata.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
obj = untangle.parse(response.text)
for url in obj.urlset.url:
print(url.loc.cdata.strip())
else:
print("Failed to retrieve XML file from the URL.")
Your output will look like this:
https://brightdata.com/case-studies/powerdrop-case-study
https://brightdata.com/case-studies/addressing-brand-protection-from-every-angle
https://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://brightdata.com/case-studies/the-seo-transformation
https://brightdata.com/case-studies/data-driven-automated-e-commerce-tools
https://brightdata.com/case-studies/highly-targeted-influencer-marketing
https://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data
https://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://brightdata.com/case-studies/data-intensive-analytical-solutions
https://brightdata.com/case-studies/canopy-advantage-solutions
https://brightdata.com/case-studies/seamless-digital-automations
untangle offers a user-friendly approach to working with XML data in Python. It simplifies the parsing process with clear syntax and automatically converts the XML structure into easy-to-use Python objects, eliminating the need for complex navigation techniques. However, keep in mind that untangle requires separate installation as it’s not part of the core Python package.
You should consider using untangle if you have a well-formed XML file and need to quickly convert it into Python objects for further processing. For example, if you have a program that downloads weather data in XML format, untangle could be a good fit to parse the XML and create Python objects representing the current temperature, humidity, and forecast. These objects could then be easily manipulated and displayed within your application.
Conclusion
In this article, you learned all about XML files and the various methods for parsing XML files in Python.
Whether you’re working with small configuration files, parsing large web service responses, or extracting data from extensive sitemaps, Python offers versatile libraries to automate and streamline your XML parsing tasks. However, when accessing files from the web using the requests library without proxy management, you may encounter quota exceptions and throttling issues. Bright Data is an award-winning proxy network that provides reliable and efficient proxy solutions to ensure seamless data retrieval and parsing. With Bright Data, you can tackle XML parsing tasks without worrying about limitations or disruptions. Contact our sales team to learn more.
Want to skip the whole scraping and parsing process? Try our dataset marketplace for free!
No credit card required