|
Tuesday, August 27, 2002
Beyond the infrastructure, you have to think about how the content of all those weblogs scales. Sure, you can put 10-20k weblogs on a single static server and fit the 5-10 required servers in a rack. (Note: I don't know a whole lot about the resources required for this; I'm trusting John Robb's numbers here.) But how can you find what you're looking for over 10-20k weblogs?
It's very easy to throw hardware and bandwidth at sites and make them scale. The costs are more or less distributed depending on your architecture. What's difficult is building a scalable community, of finding like-minded souls. And thus we have the Radio Community Server, the Blogging Ecosystem, blogdex and others — or the less-sophisticated GeoCites Member Pages directory. [The Peanut Gallery]
This point can't be emphasized enough. Without some kind of way of either organizing or being able to search (or filter) the information on a large number of weblogs, you end up with the chaos that is the Web, albeit on a smaller scale (but without Google!). So it is crucial to be able to organize those weblogs into some kind of useful structure.
I'm watching the action on the ecosystem/indexing front to see what happens. (And participating on the blogchalk front, although the organizational scheme and the idea behind it is much less formal or useful compared to, say, the Blogging Ecosystem.
8:20:06 AM
#
Tuesday, August 20, 2002
I don't know why people insist on submitting weblogs that aren't chalked. It's kind of dumb. Maybe it's my fault, maybe the page isn't marked explicitly enough. *sigh* Oh well, I'll keep pushing them into the database, with all the other 8000+ unchalked weblogs.
The thing that really drives me nuts are the ones that submit their weblog two or three times. Those just get blacklisted. I have enough other things to worry about than to bother with jerks.
5:17:33 PM
#
Thursday, August 15, 2002
The blogchalk search is now accepting weblog submissions via the web. Email had gotten to be too much.
There are currently over 250 chalked weblogs listed in the search and the number is growing daily.
11:32:15 AM
#
Tuesday, August 06, 2002
I wrote a simple search script. It is located at http://bstpierre.org/bc/. It is not very intelligent, but it works. Improvements will be made incrementally as I have time.
I am currently spidering over 5000 weblogs. About 200 are chalked. 154 of those appear in the search index (a keyword must be used by two or more weblogs to make the index).
Suggestions, comments, and cash donations are welcome!
2:08:38 PM
#
Monday, August 05, 2002
I reindexed a bunch of weblogs last week. The database has been reindexed and the list is here. I still haven't had the chance to get a searchable database working yet, though it has been started. Watch this space...
11:20:30 AM
#
Wednesday, July 24, 2002
The phenomenon seems to be growing...
In the database currently:
- there are 2821 total weblogs (this counts unique titles)
- 1251 of those are as-yet unspidered
Of the 1570 that have been spidered:
- 28 had parse failures (I'm using python's htmllib).
- 501 contain a META keywords tag
- 122 contain a META keywords tag containing "blogchalk".
I exchanged some email today with Daniel Padua about a few things. Coming soon:
- A search function.
- An out-of-page data format (xml). This will make the database more capable. www.blogchalking.tk will probably generate the necessary code for you like it does now.
If anyone has experience writing search functionality and would like to lend a hand, please let me know. Otherwise, I'll just hack together something that works, no matter how ugly.
6:55:00 PM
#
Tuesday, July 23, 2002
Ok, so I haven't chalked my blog yet. But I have written a script to search weblogs for chalk and I have a small sample compiled here. The same list, sorted by keyword is here. I have to clean up the script, fix a few things, and I'll release it within the next week or so.
11:45:00 AM
#
Here's my blogchalk: Portsmouth, NH, US.
1:35:18 PM
#
I reformatted the "chalked blogs" list to opml. The rendered list is here, keyword-sorted list is here. The raw opml can also be found in my instantOutliner folder. I'm using Marc Barrot's activeRenderer tool, which is a great piece of software.
4:20:17 PM
#
Ok, last post about blogchalking for the day. One thing I've noticed in going over the raw data that makes up the lists is that certain categories are under-represented because of formatting differences. Since I only display categories that have at least two members, if a category is misspelled or abbreviated, it might not show up! Texas, for example, has two members in the raw data. But since one blogchalk says "TX" and the other says "Texas", they don't show up...
The convention seems to be to use full location names instead of abbreviations. In other words, use "Texas" instead of "TX", "United States" instead of "US", etc. You can generate the HTML to put into your weblog at this site. That helps to keep the format consistent.
5:59:57 PM
#
|
© Copyright 2003 Brian St. Pierre.
Last update: 1/13/2003; 9:47:02 PM.
|
|