I’ve spent a lot of time trying to figure out the title question out yesterday, and decided that I might as well post the solution here as a helpful hack. The backstory here is that I’m using the AdBlock Firefox add-on to block annoying banners while browsing, and while it can dynamically prevent entire portions of HTML pages from loading, you have to configure this manually, by specifying which portions of a page to block with CSS. XPath is, sadly, not on the horizon yet.
Now, the tricky part is that, of course, web developers have caught onto the spreading use of AdBlock and, in order to keep their banner revenue, are attempting to counteract it by making their pages harder to parse automatically. I am not passing judgement upon anyone for doing this, but I personally prefer to keep my surfing ad-free, and therefore reserve the right to counteract their countermeasures.
Specifically, said countermeasures yesterday included:
- Putting banner content into unmarked <div> tags lacking intelligible IDs and classes. The CSS information seemed to be passed via direct attribute assignment, as well as by using temporary class names that are randomly generated for each page request. This precludes easy identification of the banner locations in the HTML by direct search.
- Putting the banner <div>s at a random position in a series of empty <div>s with similarly random IDs, along with–as an additional difficulty–<div>s containing actual useful content. This prevents easy identification of them by hierarchical HTML search.
With the simple search methods thus rendered unfeasible, the solution lies with recalling the semantics of the page. Specifically, while the empty tags and their nonsensical IDs and class “names” are invisible to the user, the useful content and banner content within them is–and in fact, always in the same order on the screen. With that in mind, selecting the one <div> containing the banners then boils down to what the title already gave away: selecting the second non-empty <div> child element with pure CSS.
After some tinkering with the CSS selectors, I’ve found a rather elegant solution (for my specific problem; also, this is a partial solution–a path relative to an arbitrary root):
div:nth-child(1) > div:not(:empty) ~ div:not(:empty)
What it does is first select the first <div> child of the root element (assuming you can only identify it by its position, otherwise this part can be replaced with more traditional selectors); note also that CSS counts indexes from 1, not from 0 like most modern programming languages. Then the query finds the first non-empty <div> element, and then immediately jumps to the next such at the same DOM level (the tilde operator), thus always selecting the second non-empty <div> child element.
This is not a completely full-proof method. I suspect the next step in the whole ad-supported business vs. ad-free activists arms race will be filling the currently empty <div>s with nonsensical but equally empty dummy <div>s, but I welcome that challenge, when the day comes.