python - How can I parse HTML with html5lib, and query the parsed HTML with XPath?

ID : 274523

viewed : 30

Tags : pythonparsingxpathlxmlhtml5libpython





Top 5 Answer for python - How can I parse HTML with html5lib, and query the parsed HTML with XPath?

vote vote

90

Lack of documentation is a good reason to avoid a library IMO, no matter how cool it is. Are you wedded to using html5lib? Have you looked at lxml.html?

Here is a way to do this with lxml:

from lxml import html tree = html.fromstring(text) [td.text for td in tree.xpath("//td")] 

Result:

['Header', 'Want This'] 
vote vote

85

What you want to use is the namespaceHTMLElements argument, which for some reason defaults to True.

doc = html5lib.parse('''<html>     <table>         <tr><td>Header</td></tr>         <tr><td>Want This</td></tr>     </table> </html> ''', treebuilder='lxml', namespaceHTMLElements=False)  print lxml.html.tostring(doc) 

It's probably still easier to use lxml.html however.

vote vote

71

I always recommend to try out lxml library. It's blazingly fast and has many features.

It has also support for html5lib parser if you need that: html5parser

>>> from lxml.html import fromstring, tostring  >>> html = """ ... <html> ...     <table> ...         <tr><td>Header</td></tr> ...         <tr><td>Want This</td></tr> ...     </table> ... </html> ... """ >>> doc = fromstring(html) >>> tr = doc.cssselect('table tr')[1] >>> print tostring(tr) <tr><td>Want This</td></tr> 
vote vote

63

i believe you can do css search on lxml objects.. like so

elements = root.cssselect('div.content') data = elements[0].text 
vote vote

51

With BeautifulSoup, you can do that with

>>> soup = BeautifulSoup.BeautifulSoup('<html><table><tr><td>Header</td></tr><tr><td>Want This</td></tr></table></html>') >>> soup.findAll('td')[1].string u'Want This' >>> soup.findAll('tr')[1].td.string u'Want This' 

(Obviously that's a really crude example, but ya.)

Top 3 video Explaining python - How can I parse HTML with html5lib, and query the parsed HTML with XPath?







Related QUESTION?