Dealing with dodgy markup
Lately I've been working on scraping various parliamentary websites to collect all the data I need for the YVIH API.
I might write up how I've gone about this process in more detail in another post as there's plenty to discuss. However today's post is about dealing with dodgy markup.
Of course it would be really great if the websites of Australian government adhered to W3C standards. I can assure you that at the very least, both Queensland and the ACT's 'current members' pages do not. In fact they both fail quite badly and are riddled with stray tags and other markup errors.
This is maddening. Government bodies need to be getting this sort of thing right. It's an accessibility issue for both computers and people using screen reading devices. If it isn't already, it should be a contractual requirement for anyone building a website for a government body that the work is standards compliant.
From my perspective this becomes an issue when trying to scape data. Beautiful Soup does a remarkable job of interpreting HTML and has relatively high degree of tolerance for malformed markup. However sometimes it doesn't quite do what you want it to do and in my experience Beautiful Soup tends to assume that a tag has closed when it hits something it shouldn't. This has the effect of not being able to find anything in the document or section from that point on.
After a bit of digging around though I found that Beautiful Soup supports a range of third party parsers. One of them is html5lib which parses pages the same way a web browser does and creates valid HTML5. Perfect.
So I installed html5lib using pip:
pip install html5lib
Then instead of processing the page using Beautiful Soup with the standard library:
page = requests.get(link).content soup = BeautifulSoup(page)
I tell it to use html5lib:
page = requests.get(link).content soup = BeautifulSoup(page, "html5lib")
It worked for me on the Queensland members page and I'm hopeful it will work for the ACT as well. If not, there's a couple of other parsers that I can try.