So I’m running a YaCy node – which is a pretty awesome project to create a search engine indexed “by the people, for the people.”
YaCy provides a java servent that can index internal resources and external web pages. You have MANY controls over what and how it’s indexing and the resources allocated to it. There are tons of built-in analytics and logging for the stats geek in you.
It’s still rough, but seems damned promising. A bonus – it uses jQuery and Solr.
I really like the idea of indexing all the content you care about and also providing that index to the world at large to search, but I have concerns over the long-term impact of more ‘bots crawling the web. I would like to see YaCy figure out a way to minimize it’s impact on a global level – if every yacy node is indexing the same sites, it could easily escalate to a DDoS-level problem. Perhaps they’re already working on this issue.
Along the same lines, whenever I see a decentralized service, I wonder how well its user response performance will scale relative to user numbers. If the service’s foundation is a decentralized net of nodes installed on home desktops, then doesn’t the low upload speeds of most home connections hamper the service’s ability to offer quick responses compared to the likes of Facebook and Google who have invested into fat upload pipes?
The pat answer would be to advocate for more nodes (and hopefully the addition of nodes outpaces user adoption), but let’s face it: internet-wide decentralized networks will probably never perform as well as centralized ones- especially when the corpus is the internet itself.