DMOZ.org and RDF.... When is XML not REALLY XML
When it comes from DMOZ that's when. I needed a list of categories for a project I'm working on and thought to myself "self" (says I) I bet I could use the categories from DMOZ as a starting point. SO I whipped up a simple XML parser and the saga began. Seems that the description element in the RDF contains actual HTML. Well I can fix that. I just run it through a pre-processor first using some of the new Regex stuff in JDK 1.4 and.....oh wait...missed the case....try again.....yikes, forgot that tag...one more shot.....
This went on for a while when it suddenly dawned on me that I could use a MUCH simpler approach... do { try { line = reader.readLine(); } catch ( IOException ioe ) { line = null; } if ( line != null ) { // see if we can match anything on this line int iPos = line.indexOf( "<narrow r:resource=" ); if ( iPos != -1 ) { line = line.substring( iPos + 20 ); iPos = line.indexOf( "\"/>" ); if ( iPos != -1 ) { line = line.substring( 0, iPos ); System.err.println( "Category: " + line ); } } } } while ( line != null );
This works perfectly well (though it is a bit brute-force-esque) and has a VERY low memory foot print. All that's left to do is to massage all of these categories into my database schema and I'm done. I guess the point (yes, there really is a point to all of this) is sometimes the easiest way to accomplish things is to stay simple. Either that or get categories from a true XML provider but that's a different topic.
9:41:55 AM
|