The lxml.etree Tutorial (2024)

lxml
- lxml
  - Introduction
  - Documentation
  - Download
  - Mailing list
  - Bug tracker
  - License
  - Old Versions
- Why lxml?
  - Motto
  - Aims
- lxml - Frequently Asked Questions (FAQ)
  - General Questions
  - Installation
  - Contributing
  - Bugs
  - Threading
  - Parsing and Serialisation
  - XPath and Document Traversal
- Benchmarks and Speed
  - General notes
  - How to read the timings
  - Parsing and Serialising
  - The ElementTree API
  - XPath
  - A longer example
  - lxml.objectify
- How to build lxml from source
  - Pyrex
  - Subversion
  - Setuptools
  - Running the tests and reporting errors
  - Contributing an egg
  - Providing newer library versions on Mac-OS X
  - Static linking on Windows
  - Building Debian packages from SVN sources

Developing with lxml
- The lxml.etree Tutorial
  - The Element class
  - Elements are lists
  - Elements carry attributes
  - Elements contain text
  - Tree iteration
- APIs specific to lxml.etree
  - lxml.etree
  - Other Element APIs
  - Trees and Documents
  - Iteration
  - Error handling on exceptions
  - Error logging
  - Serialisation
  - XInclude and ElementInclude
  - write_c14n on ElementTree
- Parsing XML and HTML with lxml
  - Parsers
  - iterparse and iterwalk
  - Python unicode strings
- Validation with lxml
  - DTD
  - RelaxNG
  - XMLSchema
- XPath and XSLT with lxml
  - XPath
  - XSLT
- lxml.objectify
  - Setting up lxml.objectify
  - The lxml.objectify API
  - ObjectPath
  - Python data types
  - How data types are matched
  - What is different from lxml.etree?

Extending lxml
- Document loading and URL resolving
  - Resolvers
  - Document loading in context
  - I/O access control in XSLT
- Extension functions for XPath and XSLT
  - The FunctionNamespace
  - Global prefix assignment
  - Evaluators and XSLT
  - Evaluator-local extensions
  - What to return from a function
- Using custom Element classes in lxml
  - Element initialization
  - Setting up a class lookup scheme
  - Implementing namespaces
- Sax support
  - Building a tree from SAX events
  - Producing SAX events from an ElementTree or Element
  - Interfacing with pulldom/minidom
- The public C-API of lxml.etree
  - Writing external modules in Pyrex
  - Writing external modules in C

Author:	Stefan Behnel

This tutorial briefly overviews the main concepts of the ElementTree API asimplemented by lxml.etree, and some simple enhancements that make your life asa programmer easier.

Contents

The Element class
Elements are lists
Elements carry attributes
Elements contain text
Tree iteration

A common way to import lxml.etree is as follows:

>>> from lxml import etree

If your code only uses the ElementTree API and does not rely on anyfunctionality that is specific to lxml.etree, you can also use thefollowing import chain as a fall-back to the original ElementTree:

try: from lxml import etree print "running with lxml.etree"except ImportError: try: # Python 2.5 import xml.etree.cElementTree as etree print "running with cElementTree on Python 2.5+" except ImportError: try: # Python 2.5 import xml.etree.ElementTree as etree print "running with ElementTree on Python 2.5+" except ImportError: try: # normal cElementTree install import cElementTree as etree print "running with cElementTree" except ImportError: try: # normal ElementTree install import elementtree.ElementTree as etree print "running with ElementTree" except ImportError: print "Failed to import ElementTree from any known place"

To aid in writing portable code, this tutorial makes it clear in the exampleswhich part of the presented API is an extension of lxml.etree over theoriginal ElementTree API, as defined by Fredrik Lundh's ElementTreelibrary.

An Element is the main container object for the ElementTree API. Most ofthe XML tree functionality is accessed through this class. Elements areeasily created through the Element factory:

>>> root = etree.Element("root")

The XML tag name of elements is accessed through the tag property:

>>> print root.tagroot

Elements are organised in an XML tree structure. To create child elements andadd them to a parent element, you can use the append() method:

>>> root.append( etree.Element("child1") )

However, a much more efficient and more common way to do this is through theSubElement factory. It accepts the same arguments as the Elementfactory, but additionally requires the parent as first argument:

>>> child2 = etree.SubElement(root, "child2")>>> child3 = etree.SubElement(root, "child3")

To see that this is really XML, you can serialise the tree you have created:

>>> print etree.tostring(root, pretty_print=True)<root> <child1/> <child2/> <child3/></root>

To make the access to these subelements as easy and straight forward aspossible, elements behave exactly like normal Python lists:

>>> child = root[0]>>> print child.tagchild1>>> for child in root:... print child.tagchild1child2child3>>> if root:... print "root has children!"root has children!>>> root.insert(0, etree.Element("child0"))>>> start = root[:1]>>> end = root[-1:]>>> print start[0].tagchild0>>> print end[0].tagchild3>>> root[0] = root[-1]>>> for child in root:... print child.tagchild3child1child2

Note how the last element was moved to a different position in the lastexample. This is a difference from the original ElementTree (and from lists),where elements can sit in multiple positions of any number of trees. Inlxml.etree, elements can only sit in one position of one tree at a time.

If you want to copy an element to a different position, consider creating anindependent deep copy using the copy module from Python's standardlibrary:

>>> from copy import deepcopy>>> element = etree.Element("neu")>>> element.append( deepcopy(root[1]) )>>> print element[0].tagchild1>>> print [ c.tag for c in root ]['child3', 'child1', 'child2']

To retrieve a 'real' Python list of all children (or a shallow copy of theelement children list), you can call the getchildren() method:

>>> children = root.getchildren()>>> print type(children) is type([])True>>> for child in children:... print child.tagchild3child1child2

The way up in the tree is provided through the getparent() method:

>>> root is root[0].getparent() # lxml.etree only!True

The siblings (or neighbours) of an element are accessed as next and previouselements:

>>> root[0] is root[1].getprevious() # lxml.etree only!True>>> root[1] is root[0].getnext() # lxml.etree only!True

XML elements support attributes. You can create them directly in the Elementfactory:

>>> root = etree.Element("root", interesting="totally")>>> print etree.tostring(root)<root interesting="totally"/>

Fast and direct access to these attributes is provided by the set() andget() methods of elements:

>>> print root.get("interesting")totally>>> root.set("interesting", "somewhat")>>> print root.get("interesting")somewhat

However, a very convenient way of dealing with them is through the dictionaryinterface of the attrib property:

>>> attributes = root.attrib>>> print attributes["interesting"]somewhat>>> print attributes.get("hello")None>>> attributes["hello"] = "Guten Tag">>> print attributes.get("hello")Guten Tag>>> print root.get("hello")Guten Tag

Elements can contain text:

>>> root = etree.Element("root")>>> root.text = "TEXT">>> print root.textTEXT>>> print etree.tostring(root)<root>TEXT</root>

In many XML documents (so-called data-centric documents), this is the onlyplace where text can be found. It is encapsulated by a leaf tag at the verybottom of the tree hierarchy.

However, if XML is used for tagged text documents such as (X)HTML, text canalso appear between different elements, right in the middle of the tree:

<html><body>Hello<br/>World</body></html>

Here, the <br/> tag is surrounded by text. This is often referred to asdocument-style XML. Elements support this through their tail property.It contains the text that directly follows the element, up to the next elementin the XML tree:

>>> html = etree.Element("html")>>> body = etree.SubElement(html, "body")>>> body.text = "TEXT">>> print etree.tostring(html)<html><body>TEXT</body></html>>>> br = etree.SubElement(body, "br")>>> print etree.tostring(html)<html><body>TEXT<br/></body></html>>>> br.tail = "TAIL">>> print etree.tostring(html)<html><body>TEXT<br/>TAIL</body></html>

These two properties are enough to represent any text content in an XMLdocument. If you want to read the text without the intermediate tags,however, you have to recursively concatenate all text and tailattributes in the correct order. A simpler way to do this is XPath:

>>> print html.xpath("string()") # lxml.etree only!TEXTTAIL>>> print html.xpath("//text()") # lxml.etree only!['TEXT', 'TAIL']

If you want to use this more often, you can wrap it in a function:

>>> buildTextList = etree.XPath("//text()") # lxml.etree only!>>> print buildTextList(html)['TEXT', 'TAIL']

For problems like the above, where you want to recursively traverse the treeand do something with its elements, tree iteration is a very convenientsolution. Elements provide a tree iterator for this purpose. It yieldselements in document order, i.e. in the order their tags would appear if youserialised the tree to XML:

>>> root = etree.Element("root")>>> etree.SubElement(root, "child").text = "Child 1">>> etree.SubElement(root, "child").text = "Child 2">>> etree.SubElement(root, "another").text = "Child 3">>> print etree.tostring(root, pretty_print=True)<root> <child>Child 1</child> <child>Child 2</child> <another>Child 3</another></root>>>> for element in root.getiterator():... print element.tag, '-', element.textroot - Nonechild - Child 1child - Child 2another - Child 3

If you know you are only interested in a single tag, you can pass its name togetiterator() to have it filter for you:

>>> for element in root.getiterator("child"):... print element.tag, '-', element.textchild - Child 1child - Child 2

In lxml.etree, elements provide further iterators for all directions in thetree: children, parents (or rather ancestors) and siblings.