I was trying to pull out a big description block for an item in a recent scraping project. Of course, this contains all kinds of weird and wonderful HTML formatting as it is probably built in a WYSIWYG editor.
I found that Scrapy doesn’t have a good way to handle even the simpler cases, take this HTML for example:
<div id="complex-text"> <p>This div contains <i>complex</i> text</p> <ul> <li>List item 1</li> <li>List item 2</li> </ul> <blockquote>Including quotes</blockquote> </div>
Try to get the text of this using Scrapy’s
::text psuedo-selector, like this
response.css('#complex-text') and all you will get is an empty string. Why?
innerText property. But I think that in most cases, except for really simply one’s, we need to get the full innerText, styling ignored.
Join the elements
def innertext_quick(elements, delimiter=""):
return list(delimiter.join(el.strip() for el in element.css('*::text').getall()) for element in elements)
This naive solution, however, has several problems. It will simply put all elements together. So imagine you have some list items:
<div id="complex-text"> <ul> <li>List item 1</li> <li>List item 2</li> </ul> </div>
This will end up non-delimited:
List item 1List item 2 without any spaces.
Of course, you can add spacing in between these, but it will instead cause issues if you have
<span> tags inside text where you don’t want the spaces added.
Better yet, use beautifulsoup. It will treat each HTML element as you expect it and concatenate a string. You can even control the stripping of elements you don’t want.
from bs4 import BeautifulSoup def innertext(selector): html = selector.get() soup = BeautifulSoup(html, 'html.parser') return soup.get_text().strip()
Show me the code
If you’re curious to see other alternatives to do this, check out my full article on this over at medium. I have also prepared a full repository showing the two solutions.