Patrick Beard
It's like the wild west, and everybody is sheriff... - r.v.









 

 

Tuesday, April 9, 2002
 

I'm working on a way to extract plaintext from HTML documents. My first approach will be a DFA state machine scanner that simply extracts the non-markup tokens. This will work for most documents, except when there are script tags, which will provide the bulk of the noise I expect. To remove these, I may have to resort to a full blown HTML parser.
12:40:47 PM    comment []


Click here to visit the Radio UserLand website.
Click to see the XML version of this web page.
Click here to send an email to the editor of this weblog.
© Copyright 2002 Patrick Beard .
Last update: 12/17/2002; 12:05:21 PM .
This theme is based on the SoundWaves (blue) Manila theme.
April 2002
Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30        
Mar   May