Wake Up Web 2.0

Monday, August 14, 2006

Wake up Web 2.0 to the Challenge

Hello fellow web 2.0, semantic web, mashup mavens,

My intention is to uncover one very large-scale implementation that incorporates the fundamentals behind what people are calling, web 2.0, and how it can be extrapolated to its full potential.

First, I want to point our attention to an interesting comment made by none other than the Google duo as their famous research paper on their PageRank and Centralized Repository.

The keywords would be "Anatomy of a Large Scale Hypertextual Search Engine"

This was written at a more academic, idealistic time, where, speculation and ideology were still delivered with the intent of forthcoming academic insight and predictive precision.

Interestingly enough, the paper closes with this thought about the limitations of Google's Centralized Indexing system, and the next frontier of search, which they identify as distributed, de-centralized search indexing.

Here is the final paragraph:

Of course a distributed systems like Gloss [Gravano 94] or Harvest will often be the most efficient and elegant technical solution for indexing, but it seems difficult to convince the world to use these systems because of the high administration costs of setting up large numbers of installations. Of course, it is quite likely that reducing the administration cost drastically is possible. If that happens, and everyone starts running a distributed indexing system, searching would certainly improve drastically.

-Sergey Brin and Lawrence "PageRank" Page


As we can see, the famous duo has given us some wisdom and parting thoughts as to the most important evolution in information retrieval.


A decentralized database is clearly a beautiful evolution.

Its already happened for blogs.

Each blog is just a mini index of posts. Totally searchable on its own.

But, blogs have adopted standard document types to facilitate RSS feeds that allow them to be aggregated, processed and indexed elsewhere to be available for other users and aggregators.

Blogs have easily demonstrated one possible way to make content sharable, and indexable and readable throughout various web services.



In fact, one of the most interesting evolutions in the social networking arena can be made possible by something as simple as allowing "Friend Feeds", or "Profile Feeds", which enable the aggregation of profiles in other places.

Some aggregators will focus on being comprehensive and to cover as many major sources as possible.

Other aggregators will offer specialized databases of specific domains of knowledge.

How easy would it be to make tools to model and process the friend-o-sphere, and the web-index-o-sphere when an open distributed decentralized database was fully available to compliant warehousers willing to maintain the integrity of the system?

I would say that this is nothing but a true revolution in media reaching it's full potential.



Can the social sites that want a piece of the myspace pie afford NOT to make use of this extensible platform?


Can search engine underdogs afford not to commoditize the web index, to function more on processing, user behavior, and more calculation, and thusly turn the world of "search", upsidedown?


The new data warehouse model is not that of the "site", it is of the "service", or feed.

Sites will continue to be locations to access and process data from web services.

For more information on our Social Networking Initiatives, please visit

http://www.connectsociety.com


And for information on our web object repository framework, please join us at

http://www.linkassociation.com


Search is going to fragment into it's disciplines: compilaton, structuring, and retrieval.



Search is going to fragment into it's disciplines: compilaton, structuring, and retrieval.


Web 1.0 - going to my library, doing a keyword search for books at my library building.

web 2.0 - going to your library, using the computer to search all libraries, or only the libraries relevant to you.

- Being able to make use of one or more supplimental classifiction systems for all or part of the library. (such as dewey decimal, library of congress, or any other well-defined topical heirarchy or ontology)

- Being able to use any retrieval technique from any library in the world. (not just the ones your library offers.)

Thats just the beginning. Now that this rich data is so available upon call, the software to know what you want will get even better.


Snippets from a paper on Distributed Indexing. (Danzig) [2]


This project describes an indexing scheme that will permit thorough yet efficient searches of millions of retrieval systems.

We report here an architecture which we call distributed indexing that could unite millions of autonomous, heterogeneous retrieval systems.

We can make sophisticated users out of naive (non-expert) users without employing artificial intelligence. We do this by making it easy to create autonomous databases that specialize in particular topics and types of queries.

We say such databases are precomputed because they contain the results of executing particular queries on thousands of other databases. Precomputation avoids the need for users to broadcast their queries to hundreds or thousands of databases. It eliminates work, reduces query response time, and helps locate obscure information.

The idea behind distributed indesxing is to establish databases that summarize the holdings on particular topics of other databases.


(In the system:)

An index broker is a database that builds indices. [...] Index brokers can index the contents of a "primary database" and, in fact, other index brokers.

Primary Databases are today's single-site retrieval systems such as [search engines] and online catalogs.

Site brokers store the "generator queries" (resource descriptions) of all index brokers that index their associated database, and are responsible for keeping these index brokers' indices current.

Site brokers apply their generator queries against these updates and reliably forward appropriate changes to the index brokers that originally registered the generators.

Users of the index broker may have to contact the primary databases to retrieve the object itself.

[the primary object is not obligated to give the index broker a copy of itself, nor is it prevented from doing so.]

Our architecture extends attribute-based naming to bibliographic databases.

Topic broker replication is implemented with a flooding algorithm.

Creating a new index broker entails describing its contents and selecting a set of primary databases and index brokers to index.

The broker's "generator", the query that it registers at the site brokers of the databases that it indexes, defines an index broker.

...topic broker, the one logically centralized component of the system (site brokers of primary databases also register a generator and an abstract with the topic broker.)

The topic Broker that first registered an index broker informs the index broker when a pertinent new database or index broker is created.

Executing a query involves a sequence of steps that eventually identifies or retrieves a set of relevant objects. These steps are query specification, query translation, broker location, broker querying, primary database location, primary database querying, and object retrieval.

[end notes from Danzig]




We need more than a few aspiring research mega-labs to be crunching this data.

The world years to breathe knowledge.

Toward Fellowship and Advancement,

MR




References:

1. Anatomy of a Large-Scale Hypertextual web Search Engine
Brin, Page. 1996

2. Distributed Indexing: A Scalable Mechanism for Distributed Information Retrieval
Danzig, Ahn, Noll, Obraczka. (danzig@usc.edu)

http://www.techcrunch.com/2006/08/08/web-20-the-24-minute-documentary/trackback/

0 Comments:

Post a Comment

<< Home