My other half and I are huge fans of the late movie critic Roger Ebert. We likewise share an Amazon prime subscription.
I questioned: which of Roger Ebert’s preferred films are readily available to look for complimentary on prime? Because there are numerous evaluations by Roger Ebert, I had the best reason for composing a web scraper!
In this post, I will:
- Program my not so quite scraping code
- Go over some obstructions / gotchas I encountered along the method
- Show you the list of films ranked as fantastic by Roger Ebert. That’s what you’re here for, right?
PS: If you simply wish to see the list of films, simply leap to the end of this post.
Code Quality Caution: I hacked this together as quickly as I might without much refactoring, so it’s not the most understandable or enhanced. However it primarily works… in the meantime.
I struck a couple of obstructions while dealing with this that I believe deserve calling out and will clarify a few of the choices I made in the application.
Carrying out a routine
GET with an
Accept: text/html header (which I believe is the default for the
demands library) versus the url designated to the variable
ebert_url will constantly return the very first page of films (despite what you set the
page inquiry criterion to).
Accept header field requires to be set to
application/json for the server to return JSON consisting of films for that particular page.
No public API
Initially, there is no publicaly readily available Amazon API for their brochure search. It appears like you might email them to get permission, however I didn’t wish to lose my time doing that.
Not automation friendly
I started utilizing the
demands library. Ends up that if you don’t set a correct internet browser representative, you’ll get a 503 and some message about how automation isn’t welcome. If you do phony a correct representative however you’re not setting cookies from the server respond, you’ll get:
Sorry, we simply require to ensure you’re not a robotic. For finest outcomes, please ensure your internet browser is accepting cookies.
I got annoyed and switched to utilizing a more stateful HTTP tool: mechanize.
Bad HTML …
You’ll discover that I’m utilizing some regex in the function
amazon_search to parse out the motion picture title search engine result on the page. The factor is that when I attempted utilizing
find_all function on the search engine result tags, I got absolutely nothing. My guess is that there’s some void HTML on the page and puzzled the
html.parser parser which isn’t very lax.
Ends up, instead of utilizing regex, I might have switched to utilize the
html5lib parser is the most lax parser – far more lax than
html.parser. So if I required to make extra modifications to this function, I’d refactor it to utilize that parser and eliminate the nasty looking regex.
Without additional so long, here’s all the fantastic films films that are consisted of with prime! I consisted of the complete list by means of google drive at the very end.
Here’s a FULL information set of films (not readily available on amazon, readily available however not complimentary with prime, and complimentary with prime): https://docs.google.com/spreadsheets/d/1XkdEqzXbhivEGhty_hVV8nNeJBhd4HKKSCSIM97MbjA/edit?usp=sharing.