python - Beautiful Soup and extracting a div and its contents by ID

ID : 20382

viewed : 27

Tags : pythonbeautifulsouppython

Top 5 Answer for python - Beautiful Soup and extracting a div and its contents by ID

vote vote

96

You should post your example document, because the code works fine:

>>> import BeautifulSoup >>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html') >>> soup.find("div", {"id": "articlebody"}) <div id="articlebody"> ... </div> 

Finding <div>s inside <div>s works as well:

>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html') >>> soup.find("div", {"id": "articlebody"}) <div id="articlebody"> ... </div> 
vote vote

87

To find an element by its id:

div = soup.find(id="articlebody") 
vote vote

80

Beautiful Soup 4 supports most CSS selectors with the .select() method, therefore you can use an id selector such as:

soup.select('#articlebody') 

If you need to specify the element's type, you can add a type selector before the id selector:

soup.select('div#articlebody') 

The .select() method will return a collection of elements, which means that it would return the same results as the following .find_all() method example:

soup.find_all('div', id="articlebody") # or soup.find_all(id="articlebody") 

If you only want to select a single element, then you could just use the .find() method:

soup.find('div', id="articlebody") # or soup.find(id="articlebody") 
vote vote

70

I think there is a problem when the 'div' tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags "div" with class "fcontent".

This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.

The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.

This is my code, where I just try to print the number of tags "div" with class "fcontent":

from BeautifulSoup import BeautifulSoup  f = open('/Users/myUserName/Desktop/contacts.html') soup = BeautifulSoup(f)  list = soup.findAll('div', attrs={'class':'fcontent'}) print len(list) 
vote vote

50

Most probably because of the default beautifulsoup parser has problem. Change a different parser, like 'lxml' and try again.

Top 3 video Explaining python - Beautiful Soup and extracting a div and its contents by ID

Related QUESTION?