Scrapy is great, debugging Scrapy less so
Are you adding print statements and then rerunning your scraper time and time again to get that one selector right? Do you have chrome open in the background and using jQuery to test those selectors live on the website you are trying to scrape?
I know the feeling and I’m happy to tell you there is a better way! This saved me hours.
Python has an excellent command-line interpreter I often use for checking simple syntaxes or how things work. Run
python in your shell and you are in. I use this to test small code snippets and as it automatically evaluates objects without printing it’s great for quickly checking results.
Disclaimer: If you are building anything larger than a small script, I highly recommend having a test suite. By running this on file update instead, you get validation on your functionality and syntax without having to copy/paste or rewrite code.
Debugging your scraper
First up, let’s talk about how we can validate what is going on inside our brand new scraper without having to rely on print statements. The solution here is to use the inspect_response method to trigger a Scrapy shell to run inside your scraper. This is an excellent way to break the code execution and debug in-process.
Take for example this example scraper I’ve setup for my own blog:
Placing this in a file I can now run this and inspect the response right away:
You can use
Ctrl+D to exit the shell and continue scraping or
quit() to abort.
Inspect a single page
Now for the best part. You don’t even have to have a scraper built. You can simply pass in your URL to Scrapy and run it as-is:
scrapy shell https://greycastle.se
Moving from using jQuery and print statements to debugging using the shell like above cut down the time it took me to build scrapers by hours. Once you begin relying on helper utils and classes to scrape parts of your it becomes harder as it’s difficult to iterate on writing these in the shell.
I think it should be possible to run such things for testing in unit tests so I will try that sometime up ahead.
Hope this helps, (you can now) enjoy your scraping!