XML might be the next great hot silver bullet.
BUT, if you want to be heard, you will not thumb your nose at those who
use something besides Internet Explorer. Depending on which web site you
take your statistics from, you can get anywhere from 50% to 90% share for
But the original question was a bit confusing. Tom, if you are working on
a Web crawler, it seems to me you don't care that much about what's
"legal" HTML, and even less about "good" HTML. Instead, you want to start
examining some of the garbage generated by programs like FrontPage,
Composer, PageMill, GoLive, Microsoft Word, Fusion, AOLPress,
Arachnophilia, etc. Yes, even Microsoft Word can be and is being used to
generate Web pages. And it is the WORST tool for that purpose I have ever
seen. However, sometimes its pages can actually be rendered by a browser!
If you are crawling an "intranet" controlled by a company that doesn't
allow tools like these, that's another story.
It's worth noting that W3 itself offers an HTML WYSIWYG generator called
"Amaya" which is offered as a demo of "correct" HTML4.0. I've never
examined its HTML output, but I can tell you that when used as a browser,
it is NOT compliant with W3's own specs. So it should be no surprise that
neither Netscape nor Explorer is either.