blogchalking
Spidering weblogs, looking for chalk.

 


January 2003
Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  
Aug   Feb








Blogchalking


Subscribe to "blogchalking" in Radio UserLand.

Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog.

jenett.radio.randomizer - click to visit a random Radio weblog - for information, contact randomizer@coolstop.com

Blogchalk:
Portsmouth, NH, US




 

Tuesday, August 27, 2002

Communities of Scale (2)

Beyond the infrastructure, you have to think about how the content of all those weblogs scales. Sure, you can put 10-20k weblogs on a single static server and fit the 5-10 required servers in a rack. (Note: I don't know a whole lot about the resources required for this; I'm trusting John Robb's numbers here.) But how can you find what you're looking for over 10-20k weblogs?

It's very easy to throw hardware and bandwidth at sites and make them scale. The costs are more or less distributed depending on your architecture. What's difficult is building a scalable community, of finding like-minded souls. And thus we have the Radio Community Server, the Blogging Ecosystem, blogdex and others — or the less-sophisticated GeoCites Member Pages directory.
[The Peanut Gallery]

This point can't be emphasized enough. Without some kind of way of either organizing or being able to search (or filter) the information on a large number of weblogs, you end up with the chaos that is the Web, albeit on a smaller scale (but without Google!). So it is crucial to be able to organize those weblogs into some kind of useful structure.

I'm watching the action on the ecosystem/indexing front to see what happens. (And participating on the blogchalk front, although the organizational scheme and the idea behind it is much less formal or useful compared to, say, the Blogging Ecosystem.

8:20:06 AM   #  

Tuesday, August 20, 2002

Ignorance

I don't know why people insist on submitting weblogs that aren't chalked. It's kind of dumb. Maybe it's my fault, maybe the page isn't marked explicitly enough. *sigh* Oh well, I'll keep pushing them into the database, with all the other 8000+ unchalked weblogs.

The thing that really drives me nuts are the ones that submit their weblog two or three times. Those just get blacklisted. I have enough other things to worry about than to bother with jerks.

5:17:33 PM   #  
categories: blogchalking

Thursday, August 15, 2002

Blogchalking Search now accepting submissions via web

The blogchalk search is now accepting weblog submissions via the web. Email had gotten to be too much.

There are currently over 250 chalked weblogs listed in the search and the number is growing daily.

11:32:15 AM   #  
categories: blogchalking

Tuesday, August 06, 2002

Blogchalking: Search

I wrote a simple search script. It is located at http://bstpierre.org/bc/. It is not very intelligent, but it works. Improvements will be made incrementally as I have time.

I am currently spidering over 5000 weblogs. About 200 are chalked. 154 of those appear in the search index (a keyword must be used by two or more weblogs to make the index).

Suggestions, comments, and cash donations are welcome!

2:08:38 PM   #  
categories: blogchalking

Monday, August 05, 2002

Blogchalking Reindexed

I reindexed a bunch of weblogs last week. The database has been reindexed and the list is here. I still haven't had the chance to get a searchable database working yet, though it has been started. Watch this space...

11:20:30 AM   #  
categories: blogchalking

Wednesday, July 24, 2002

Some Blogchalk Data

The phenomenon seems to be growing...

In the database currently:

  • there are 2821 total weblogs (this counts unique titles)
  • 1251 of those are as-yet unspidered

Of the 1570 that have been spidered:

  • 28 had parse failures (I'm using python's htmllib).
  • 501 contain a META keywords tag
  • 122 contain a META keywords tag containing "blogchalk".

I exchanged some email today with Daniel Padua about a few things. Coming soon:

  • A search function.
  • An out-of-page data format (xml). This will make the database more capable. www.blogchalking.tk will probably generate the necessary code for you like it does now.

If anyone has experience writing search functionality and would like to lend a hand, please let me know. Click here to send an email to the editor of this weblog. Otherwise, I'll just hack together something that works, no matter how ugly.

6:55:00 PM   #  
categories: blogchalking

Tuesday, July 23, 2002

Blogchalking

Ok, so I haven't chalked my blog yet. But I have written a script to search weblogs for chalk and I have a small sample compiled here. The same list, sorted by keyword is here. I have to clean up the script, fix a few things, and I'll release it within the next week or so.
11:45:00 AM   #  
categories: blogchalking

My Blogchalk

Here's my blogchalk: Portsmouth, NH, US.
1:35:18 PM   #  
categories: blogchalking

Blogchalker Lists, now in opml

I reformatted the "chalked blogs" list to opml. The rendered list is here, keyword-sorted list is here. The raw opml can also be found in my instantOutliner folder. I'm using Marc Barrot's activeRenderer tool, which is a great piece of software.
4:20:17 PM   #  
categories: blogchalking

Effective Blogchalking

Ok, last post about blogchalking for the day. One thing I've noticed in going over the raw data that makes up the lists is that certain categories are under-represented because of formatting differences. Since I only display categories that have at least two members, if a category is misspelled or abbreviated, it might not show up! Texas, for example, has two members in the raw data. But since one blogchalk says "TX" and the other says "Texas", they don't show up...

The convention seems to be to use full location names instead of abbreviations. In other words, use "Texas" instead of "TX", "United States" instead of "US", etc. You can generate the HTML to put into your weblog at this site. That helps to keep the format consistent.

5:59:57 PM   #  
categories: blogchalking


Click here to visit the Radio UserLand website. © Copyright 2003 Brian St. Pierre.
Last update: 1/13/2003; 9:47:02 PM.