I was trying to pull out a big description block for an item in a recent scraping project. Of course, this contains all kinds of weird and wonderful HTML formatting as it is probably built in a WYSIWYG editor.

I found that Scrapy doesn’t have a good way to handle even the simpler cases, take this HTML for example:

<div id="complex-text">
    <p>This div contains <i>complex</i> text</p>
    <ul>
        <li>List item 1</li>
        <li>List item 2</li>
    </ul>
    <blockquote>Including quotes</blockquote>
</div>

Try to get the text of this using Scrapy’s ::text psuedo-selector, like this response.css('#complex-text') and all you will get is an empty string. Why?

The ::text psuedo-selector will only return the text content of the element you select, not the innerText as we would expect from the Javascript innerText property. But I think that in most cases, except for really simply one’s, we need to get the full innerText, styling ignored.

Join the elements

def innertext_quick(elements, delimiter=""):
return list(delimiter.join(el.strip() for el in element.css('*::text').getall()) for element in elements)

This naive solution, however, has several problems. It will simply put all elements together. So imagine you have some list items:

<div id="complex-text">
  <ul>
    <li>List item 1</li>
    <li>List item 2</li>
  </ul>
</div>

This will end up non-delimited: List item 1List item 2 without any spaces.

Of course, you can add spacing in between these, but it will instead cause issues if you have <span> tags inside text where you don’t want the spaces added.

Use bs4

Better yet, use beautifulsoup. It will treat each HTML element as you expect it and concatenate a string. You can even control the stripping of elements you don’t want.

from bs4 import BeautifulSoup

def innertext(selector):
    html = selector.get()
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text().strip()

Show me the code

If you’re curious to see other alternatives to do this, check out my full article on this over at medium. I have also prepared a full repository showing the two solutions.

https://github.com/ddikman/scrapy-innertext

Happy scraping