Updated: 6/7/2002; 4:05:26 PM.
Java Geek
Steve Goyette - Self described Java Geek at large.
        

Wednesday, May 29, 2002

DMOZ.org and RDF....
When is XML not REALLY XML

When it comes from DMOZ that's when.  I needed a list of categories for a project I'm working on and thought to myself "self" (says I) I bet I could use the categories from DMOZ as a starting point.  SO I whipped up a simple XML parser and the saga began.  Seems that the description element in the RDF contains actual HTML.  Well I can fix that.  I just run it through a pre-processor first using some of the new Regex stuff in JDK 1.4 and.....oh wait...missed the case....try again.....yikes, forgot that tag...one more shot.....

This went on for a while when it suddenly dawned on me that I could use a MUCH simpler approach...

         do {
            try {
               line = reader.readLine();
            } catch ( IOException ioe ) {
               line = null;
            }
            if ( line != null ) {
               // see if we can match anything on this line
               int iPos = line.indexOf( "<narrow r:resource=" );
               if ( iPos != -1 ) {
                  line = line.substring( iPos + 20 );
                  iPos = line.indexOf( "\"/>" );
                  if ( iPos != -1 ) {
                     line = line.substring( 0, iPos );
                     System.err.println( "Category: " + line );
                  }
               }
            }
         } while ( line != null );

This works perfectly well (though it is a bit brute-force-esque) and has a VERY low memory foot print.  All that's left to do is to massage all of these categories into my database schema and I'm done.  I guess the point (yes, there really is a point to all of this) is sometimes the easiest way to accomplish things is to stay simple.  Either that or get categories from a true XML provider but that's a different topic.


9:41:55 AM    comment []


© Copyright 2002 Steve Goyette.
 
May 2002
Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  
Apr   Jun


Click here to visit the Radio UserLand website.


Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog.