cssselect parses CSS3 Selectors and translate them to XPath 1.0 expressions. Such expressions can be used in lxml or another XPath engine to find the matching elements in an XML or HTML document.
This module used to live inside of lxml as lxml.cssselect before it was extracted as a stand-alone project.
Quick facts:
Use HTMLTranslator for HTML documents, GenericTranslator for “generic” XML documents. (The former has a more useful translation for some selectors, based on HTML-specific element types or attributes.)
>>> from cssselect import GenericTranslator, SelectorError
>>> try:
... expression = GenericTranslator().css_to_xpath('div.content')
... except SelectorError:
... print('Invalid selector.')
...
>>> print(expression)
descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' content ')]
The resulting expression can be used with lxml’s XPath engine:
>>> from lxml.etree import fromstring
>>> document = fromstring('''
... <div id="outer">
... <div id="inner" class="content body">text</div>
... </div>
... ''')
>>> [e.get('id') for e in document.xpath(expression)]
['inner']
In CSS3 Selectors terms, the top-level object is a group of selectors, a sequence of comma-separated selectors. For example, div, h1.title + p is a group of two selectors.
Parse a CSS group of selectors.
If you don’t care about pseudo-elements or selector specificity, you can skip this and use css_to_xpath().
Parameters: | css – A group of selectors as an Unicode string. |
---|---|
Raises : | SelectorSyntaxError on invalid selectors. |
Returns: | A list of parsed Selector objects, one for each selector in the comma-separated group. |
Represents a parsed selector.
selector_to_xpath() accepts this object, but ignores pseudo_element. It is the user’s responsibility to account for pseudo-elements and reject selectors with unknown or unsupported pseudo-elements.
The identifier for the pseudo-element as a string, or None.
Selector | Pseudo-element | |
---|---|---|
CSS3 syntax | a::before | 'before' |
Older syntax | a:before | 'before' |
From the Lists3 draft, not in Selectors3 | li::marker | 'marker' |
Invalid pseudo-class | li:marker | None |
Return the specificity of this selector as a tuple of 3 integers.
Translator for “generic” XML documents.
Everything is case-sensitive, no assumption is made on the meaning of element names and attribute names.
Translate a group of selectors to XPath.
Pseudo-elements are not supported here since XPath only knows about “real” elements.
Parameters: | css – A group of selectors as an Unicode string. |
---|---|
Raises : | SelectorSyntaxError on invalid selectors, ExpressionError on unknown/unsupported selectors, including pseudo-elements. |
Returns: | The equivalent XPath 1.0 expression as an Unicode string. |
Translate a parsed selector to XPath.
The pseudo_element attribute of the selector is ignored. It is the caller’s responsibility to reject selectors with pseudo-elements, or to account for them somehow.
Parameters: | selector – A parsed Selector object. |
---|---|
Raises : | ExpressionError on unknown/unsupported selectors. |
Returns: | The equivalent XPath 1.0 expression as an Unicode string. |
Translator for (X)HTML documents.
Has a more useful implementation of some pseudo-classes based on HTML-specific element names and attribute names, as described in the HTML5 specification. It assumes no-quirks mode. The API is the same as GenericTranslator.
Parameters: | xhtml – If false (the default), element names and attribute names are case-insensitive. |
---|
Common parent for SelectorSyntaxError and ExpressionError.
You can just use except SelectorError: when calling css_to_xpath() and handle both exceptions types.
Parsing a selector that does not match the grammar.
Unknown or unsupported selector (eg. pseudo-class).
This library implements CSS3 selectors as described in the W3C specification. In this context however, there is no interactivity or history of visited links. Therefore, these pseudo-classes are accepted but never match anything:
Additionally, these depend on document knowledge and only have a useful implementation in HTMLTranslator. In GenericTranslator, they never match:
These applicable pseudo-classes are not yet implemented:
On the other hand, cssselect supports some selectors that are not in the Level 3 specification:
Just like HTMLTranslator is a subclass of GenericTranslator, you can make new sub-classes of either of them and override some methods. This enables you, for example, to customize how some pseudo-class is implemented without forking or monkey-patching cssselect.
The “customization API” is the set of methods in translation classes and their signature. You can look at the source code to see how it works. However, be aware that this API is not very stable yet. It might change and break your sub-class.
In CSS you can use namespace-prefix|element, similar to namespace-prefix:element in an XPath expression. In fact, it maps one-to-one. How prefixes are mapped to namespace URIs depends on the XPath implementation.
Released on 2012-06-14. Code name remember-to-test-with-tox.
0.7 broke the parser in Python 2.4 and 2.5; the tests in 2.x. Now all is well again.
Also, pseudo-elements are now correctly made lower-case. (They are supposed to be case-insensitive.)
Released on 2012-06-14.
Bug fix release: see #2, #7 and #10 on GitHub.
Released on 2012-04-25.
Make sure that internal token objects do not “leak” into the public API and Selector.pseudo_element is an unicode string.
Released on 2012-04-24.
Released on 2012-04-20.
Released on 2012-04-18.
Released on 2012-04-17.
Discussion is open if anyone is interested in implementing eg. :target or :visited differently, but they can always do it in a Translator subclass.
Released on 2012-04-16.
These changes allow cssselect to be used without lxml. (Hey, this was the whole point of this project.) The tests still require lxml, though. The removed parts are expected to stay in lxml for backward-compatibility.
:contains() only existed in an early draft of the Selectors specification, and was removed before Level 3 stabilized. Internally, it used a custom XPath extension function which can be difficult to express outside of lxml.
Subclasses of Translator can be made to change the way that some selector (eg. a pseudo-class) is implemented.
Released on 2012-04-13.
Extract lxml.cssselect from the rest of lxml and make it a stand-alone project.
Commit ea53ceaf7e44ba4fbb5c818ae31370932f47774e was taken on 2012-04-11 from the ‘master’ branch of lxml’s git repository. This is somewhere between versions 2.3.4 and 2.4.
The commit history has been rewritten to:
This project has its own import name, tests and documentation. But the code itself is unchanged and still depends on lxml.
Search for cssselect in lxml’s changelog