For a long time I have been using my application to parse Microsoft Knowledgebase articles and store the content locally to display these articles. All of a sudden the parser stopped working. When I started debugging I realized that there was some changes made by Microsoft on their knowledgebase pages that altered how the flow of mark up. When I initially did the application I had to save the page as HTML and then open it up in Visual Studio editor and format to figure out the structure. Not any more. Now we have handy dandy tool called FireBug that I can fireup and easily figure out the structure of the pages. You can do the same with Internet Explorer Developer Tool bar as well. After that my code to parse the page was reduced down the following snippet.
NodeFilter tableFilter = new NodeClassFilter(typeof(TableColumn)); NodeFilter obAttribFilter = new HasAttributeFilter("class", "listContainer"); NodeFilter andFilter = new AndFilter(tableFilter, obAttribFilter); NodeList tblNodes = obParser.ExtractAllNodesThatMatch(andFilter);
I use HTMLParser.Net for all my web page parsing. Thats what reduced the parsing to those 4 lines of code.