Wake Up Web 2.0

Wednesday, August 30, 2006

Roadmap to Realizing the Open Platform's Promise

Roadmap to Realizing the Open Platform's Promise

We've seen what can happen when powerfully integrated technologies can mash together complimentary data and processing components from various independent sources, and extend the functions to a larger set of capabilities.

We're watching web pages grow up to become services, or to use services. We're seeing web
software operate anywhere.

The networking and (meta)informational structures are both experiencing wide development.

Until now, there has not been serious discussion related to implementing a completely open data store, along with it's open components for serving, retrieving, analysing, and publishing data from the tool set.

The time has arrived for a myriad of open data stores to be supported to become a resource more vast, powerful, and leveling than anything the world has never seen.





What is the Opportunity?

The opportunity is to combine the recent developments in social computing and interoperable document exchange formats to create a scalable document and resource repository to make a large amount of documents available to be structured, tagged, classified and processed by users of the open public resource.

This technique combines the meta-data promised from the semantic web, the community tagging made popular by Web 2.0, as well as the distribution system from the Handle system, and ontological classification systems such as the Ontological Web Language.


Most organizations won't aim to house all of the data that is available.

Even for the ones that attempt to 'grab it all', they won't be able to organize or structure it on their own.

That is why the distributed system for summarizing other resources is so important.

Some groups will attempt the current model of trying to grab everything, (and doing very little human sorting and tagging with that unmanagable amount. The amount is especially unmanageable when it is behind a black box to the rest of the world.) Current data stores suffer from too much data, and not enough knowledge.

Many other groups will find that there are many companies that are actually already pretty good at "getting everything".

They just aren't good at finding everything when you need it. Without structure, organization or even an attempt at classification, it's really a tangled and unmanagable web.



New specialty "search engines" will find their true value-add by concentrating on storing only documents within certain domains of knowledge, or only indexing portions of documents, or structural relationships among documents.

I hesitate to continue the usage of the word "search engine". The term is dubious, a search engine consists of many components including:

Document / Resource Index
*Algorithm for relevance
Algoritm for Sorting by Influence or Authority

* One great way for relevance beyond keyword searching, in fact, the only INTELLIGENT way to do it, is by using human-created conceptually structured Ontologies (Ontological Web Language - OWL)

The major problem of today is that the index is locked down, only documents are given.

The algorithm doesn't work with real conceptual structures, it deals mostly with keyword density.

Thus, authority is still very easy to fake, since it is determined probabalistically.

The engines never know who to trust, or what is what, or what belongs where - so they take votes, and often, the crowd is wrong, and manipulators of the crowd can cheat the results, whether they see it that way or not.

The current "engines" treat the world's knowledge as nothing more than a string of symbols without any kind of anchor to a structured conceptual framework.

The influence algorithms will always remain a vital part of information retrieval technology, and companies such as Google will continue to thrive on their intelligent processing techniques.

What the opportunity is, to level the playing field related to processing large document stores, thusly enabling distributed ontologies to be created that can be applied to large amounts of important documents.

The new service doesn't promise to get every last document in the world. It promises to know what the ones that are important for what topic. Where they belong, what they are about, and what documents are about them.

It also allows anyone to process as much of the raw data in the index as they wish, which they can summarize, and then offer their summaries as part of their own searchable resource.




What are the limitations of the current system?

When google crawls altavista, it treats it like every other web page, based on it's keywords.

Never mind the fact that altavista happens to contain billions of web sites available by plugging in keywords.

This is only treated for how it appears as a "site" to a browser, not how it offers it's "information" to its users.


The same problem occurs with specialized search engines and resources.


Many times, a document is not the best resource for a query, and a resource such as a specialized search engine is an entirely viable return for a query. It may be best to think of these resources as complimentary results that enrich rather than replace the current search tools.

Something like Google won't be able to handle this, because every resource is just another page to them.

Whether you can search 1 document, or 10 billion, google goes by the external pages only.

So, you're missing out. You can't go where Google hasn't gone. Nevermind the fact that Google never really knows where it's going. It doesn't know what it's pages refer to, it could be intended as fiction, or fact, or it could be presented as fact but be otherwise. In any case, it's all just a list of symbols as far as Google is concerned.

"The wisdom of crowds...isn't."


The benefit is for other groups to query large engines with specialized queries, or indexing certain components, and then summarize this data, and make it available to their own searchers (and other data banks).

This can be thought of as pre-processed data.


How would you like your information retrieval oracle to bring you this message?

"We've already search everywhere for what you want, and we've found that these are the places that have the most organized occurences in their query results. SEARCH HERE"

I don't know about you, but thats the kind of information I'm looking to retrieve.

Its the classification, stupid.





What allows this to happen now?

Well, every network is effected by itself as well as other external networks and users.

The web industry has made real strides in software interoperability.

Most of the techniques that comprise the new "open platform" have been known for at least 5 to 10 years, if not much earlier in a more general form.

Hardware directly allows the information to scale, and hardware cost reduction has also allowed more users to come on board.

Someone could write a book on the emerging formats and hardware and user patterns, and what drove what.

I think it's not so important as to who is driving, but where the destination is going to be.


Lets look at this further.


What are the components?


The first compontent is a general index of resources. A resource could be a single document, a small organized group of lists, or a vast index with various levels of structure.

The network architecture for system to synchronize and communication will be implemented by the Handle system, similar to what manages the DNS servers in a non-central way.

Unlike the DNS server, there will be more layers than just the IP address and domain name.

There will be a Topic Index Layer, Various Levels of Index Broker Layers, as well the Site Broker Layer (what we currently think of as a "search engine"), and of course the documents.

This is basically a minimum of 4 tiers, with the index brokers having many levels of organization possible.

(Remember, the classification systems, known as ontologies form another independent portion, as well as the lexicon(s) used by the searcher and ontologists.)


The topic broker is the "oracle" that aims to point you in the right direction, even if that is only to another search engine of search engines. (don't worry, its going to be automatic)


The index brokers will tell you what search engine(s) you should use.

The site brokers will act as a current search engine, but they will be often be evaluated based on their own aggregate scores.

Think of search engines themselves being scored by their own keyword or pagerank scores, based on how good they are at finding certain things. This may include things like relevant document density, (similar to keyword density) as well as document structure, which ensures that the document is not a dangling page, but part of a larger knowledge web.

And, finally, my personal favorite layer, the final end result documents we were looking for all along.


This may sound complex, but it's truly how the open platform has already been planned.




Who is going to do this?


The Open Information Platform will take many players working in concert to achieve it's goals.


The first layer is one or more topic brokers.

These will be the ones who try to get everything, but unlike Google, they don't pretend to be an expert about everything.

If another index knows more than they do, they'll tell you, and actually bring you there automatically.



The index brokers will specialize in topics, or restricted indexing of specific components of web documents, either over a broad or narrow set of documents.

Other index brokers will be indexers of other indexers.

The site brokers act like today's modern search engine, in that they return documents, but they operate with a DEFINITE ontology, with authority (may be custom configurable and come from yet another 3rd party (actually, it would be a 6th party, but who's counting)

This allows these site brokers to return documents based on authoritative models of relevancy, not just keyword heuristics.


Then there are the documents.

I didn't want to leave these guys out.

These guys will interface with the ontologies in both explicit and implicit ways, both on and off their sites.

Parts of the new meta-data exchange will be enhanced by extra tagging, and "trackbacks" that let you know what places index, cache, warehouse, summarize or monitor that document.

This is primilary used for the other engines to get alerted when an update occurs, rather than having to chech 3 times a week like Google does.

Documents, their additional object naming characteristics (documents become entities beyond their simple location.)

A location can always be down, or moved, or abandoned.

The idea of a name or a document being "down", is just as absurd as the concept of "Alexander the Great" being "down" on all areas of the web, like wikipedia and all other engines.

Other ontological information related to the documents will be stored by third party ontology providers that work using the document names and extended properties.






So, to say it again, who are the players?

The architects. They make the standards and solutions. Or, at least the core system of protocols to communicate the objects of the repositories, their names, and the locations of their extended data and relationships.

Actually extending the data and the relationships among objects is the domain of the ontologists.

So, there are many layers, including the Network, the ontologies, as well as the interfaces, and various processing and sorting algorithms for relevance, influence and credibility.

The topic and index brokers. They go for everything within their domain, and specialize in redirecting people to the best resources, not the best documents.

As well, this new open platform will give rise to the potential to process, research, summarize, map, maintain and support the ever-increasing wealth this information can bring when made open.



Effects on the World :

This technology and it's integration will do more than transform the world of "IT", or the "internet".

Its scope will ripple to reach every corner of the earth.

Time is inherently more valuable to someone having access to this toolset.

This is a large change in the relative value between time and capital, or other physical property.

Time will continue to leapfrog in value, and not the least for the underprivilaged.

To put it simply, the implementation of this Idea is the Trickle Down theory of Knowledge promulgation.

Nothing, and I mean nothing will do more to close the digital divide, than this format.


Forget about putting computers in schools and villiages.

When this open platform reaches it's inflection point, the hardware will invest in itself.


This concept is truly the Reaganomics of philanthrophy through growth in to economies rich and poor.

No other media format will level the world as much as this.

And our society will never be quite the same.




Propagation Strategy.


So, who are the early adopters? How do we deal with issues of existing organizations not wanting to lose their competitive edge using their closed networks?


There are many groups who need to become more aware of how close we are to achieving this open system of Information Retrieval.

They are:

The Bloggers - leading the front line in Tech, Business and Politics. These are the scouts and rangers of the Movement.

The Academic world - This group has engineered the infrastructure and will process and shape it's progendy. They are the research labs.

The Corporate World - The ones who brought all the components together, and will eventually adopt and integrate this system to their offerings, either as a server, or client, or a complex blend. (The prosumer, to use a Alvin Toffler word.)

Government - You guessed it, the State. No one can invest in infrastructure like the Government. And, no one stands to gain more from the industry this platform will incubate.

There are also other groups, such as non-profits, non-government organizations, philanthropic organizations, as well as political groups who can benefit from this framework.



Challeges -

The very naming of a concept this vast is highly problematic.

Saying that it is an upgrade to the "web" is quite demeaning, because the web only applies to HTML documents made primarily for display purposes.

Terms like web 2.0, semantic web, social computing, active networks, distributed indexing may talk about one benefit, or allude to an evolution of sorts, but they all fall short in describing it.

Neural Name Net ?

Hey, it's got something going for it.



Everyone of the terms has something going for it, but misses others.

Other challenges will be dealing with resistance from existing providers of information retrieval technology, and large data warehouses like myspace, google, amazon, and all the rest of the capitalist machines.



Personally, my latest codename for the project is S.C.O.P.E -

Social Computing Open Protocol Environment

It also has a meaning as in Scope, the tool that one uses to scope.

And, the act of scoping.

As well as the scope of what is being surveyed, as well as the grammatical idea of scope, which determines the scope of a modifier - this is similar to meta-data.






[I had originally made a reference to the parable of david and goliath, and tried to change the story to fit my needs..

It turns out this was ill-conceived after a marathon blogging session.

The correct parable is the Tower of Babbel, for humanity to build a tower to reach the heavens.]


One one side is The old way of communications, with his war chest of cash, his data, his algorithms, his designers, his mathematicians, his venture capital ties, his investment banking clout, armies of shareholders behind him.

It would seem like nothing could stop this process.

But look, just next to every company is another company, with just as much cash, data, algos, and just as many people supporting his reign.

Now it seems there are a sea of companies, all seeking to take from one another, or to buy them out outright.

It would seem that nothing could get in the way of the mega-companies that combine the great Golliaths of the past.

But, in the distance, without anyone listening or paying attention, is the linguist, having nothing but his own language for cooperation.

He gives nothing but a whisper to the meekest of groups, and at that moment, every company will now either have to speak the new Language of Collaboration, or be history.





Let knowledge trickle down and let ye breathe wisdom.


MR

Monday, August 14, 2006

Wake up Web 2.0 to the Challenge

Hello fellow web 2.0, semantic web, mashup mavens,

My intention is to uncover one very large-scale implementation that incorporates the fundamentals behind what people are calling, web 2.0, and how it can be extrapolated to its full potential.

First, I want to point our attention to an interesting comment made by none other than the Google duo as their famous research paper on their PageRank and Centralized Repository.

The keywords would be "Anatomy of a Large Scale Hypertextual Search Engine"

This was written at a more academic, idealistic time, where, speculation and ideology were still delivered with the intent of forthcoming academic insight and predictive precision.

Interestingly enough, the paper closes with this thought about the limitations of Google's Centralized Indexing system, and the next frontier of search, which they identify as distributed, de-centralized search indexing.

Here is the final paragraph:

Of course a distributed systems like Gloss [Gravano 94] or Harvest will often be the most efficient and elegant technical solution for indexing, but it seems difficult to convince the world to use these systems because of the high administration costs of setting up large numbers of installations. Of course, it is quite likely that reducing the administration cost drastically is possible. If that happens, and everyone starts running a distributed indexing system, searching would certainly improve drastically.

-Sergey Brin and Lawrence "PageRank" Page


As we can see, the famous duo has given us some wisdom and parting thoughts as to the most important evolution in information retrieval.


A decentralized database is clearly a beautiful evolution.

Its already happened for blogs.

Each blog is just a mini index of posts. Totally searchable on its own.

But, blogs have adopted standard document types to facilitate RSS feeds that allow them to be aggregated, processed and indexed elsewhere to be available for other users and aggregators.

Blogs have easily demonstrated one possible way to make content sharable, and indexable and readable throughout various web services.



In fact, one of the most interesting evolutions in the social networking arena can be made possible by something as simple as allowing "Friend Feeds", or "Profile Feeds", which enable the aggregation of profiles in other places.

Some aggregators will focus on being comprehensive and to cover as many major sources as possible.

Other aggregators will offer specialized databases of specific domains of knowledge.

How easy would it be to make tools to model and process the friend-o-sphere, and the web-index-o-sphere when an open distributed decentralized database was fully available to compliant warehousers willing to maintain the integrity of the system?

I would say that this is nothing but a true revolution in media reaching it's full potential.



Can the social sites that want a piece of the myspace pie afford NOT to make use of this extensible platform?


Can search engine underdogs afford not to commoditize the web index, to function more on processing, user behavior, and more calculation, and thusly turn the world of "search", upsidedown?


The new data warehouse model is not that of the "site", it is of the "service", or feed.

Sites will continue to be locations to access and process data from web services.

For more information on our Social Networking Initiatives, please visit

http://www.connectsociety.com


And for information on our web object repository framework, please join us at

http://www.linkassociation.com


Search is going to fragment into it's disciplines: compilaton, structuring, and retrieval.



Search is going to fragment into it's disciplines: compilaton, structuring, and retrieval.


Web 1.0 - going to my library, doing a keyword search for books at my library building.

web 2.0 - going to your library, using the computer to search all libraries, or only the libraries relevant to you.

- Being able to make use of one or more supplimental classifiction systems for all or part of the library. (such as dewey decimal, library of congress, or any other well-defined topical heirarchy or ontology)

- Being able to use any retrieval technique from any library in the world. (not just the ones your library offers.)

Thats just the beginning. Now that this rich data is so available upon call, the software to know what you want will get even better.


Snippets from a paper on Distributed Indexing. (Danzig) [2]


This project describes an indexing scheme that will permit thorough yet efficient searches of millions of retrieval systems.

We report here an architecture which we call distributed indexing that could unite millions of autonomous, heterogeneous retrieval systems.

We can make sophisticated users out of naive (non-expert) users without employing artificial intelligence. We do this by making it easy to create autonomous databases that specialize in particular topics and types of queries.

We say such databases are precomputed because they contain the results of executing particular queries on thousands of other databases. Precomputation avoids the need for users to broadcast their queries to hundreds or thousands of databases. It eliminates work, reduces query response time, and helps locate obscure information.

The idea behind distributed indesxing is to establish databases that summarize the holdings on particular topics of other databases.


(In the system:)

An index broker is a database that builds indices. [...] Index brokers can index the contents of a "primary database" and, in fact, other index brokers.

Primary Databases are today's single-site retrieval systems such as [search engines] and online catalogs.

Site brokers store the "generator queries" (resource descriptions) of all index brokers that index their associated database, and are responsible for keeping these index brokers' indices current.

Site brokers apply their generator queries against these updates and reliably forward appropriate changes to the index brokers that originally registered the generators.

Users of the index broker may have to contact the primary databases to retrieve the object itself.

[the primary object is not obligated to give the index broker a copy of itself, nor is it prevented from doing so.]

Our architecture extends attribute-based naming to bibliographic databases.

Topic broker replication is implemented with a flooding algorithm.

Creating a new index broker entails describing its contents and selecting a set of primary databases and index brokers to index.

The broker's "generator", the query that it registers at the site brokers of the databases that it indexes, defines an index broker.

...topic broker, the one logically centralized component of the system (site brokers of primary databases also register a generator and an abstract with the topic broker.)

The topic Broker that first registered an index broker informs the index broker when a pertinent new database or index broker is created.

Executing a query involves a sequence of steps that eventually identifies or retrieves a set of relevant objects. These steps are query specification, query translation, broker location, broker querying, primary database location, primary database querying, and object retrieval.

[end notes from Danzig]




We need more than a few aspiring research mega-labs to be crunching this data.

The world years to breathe knowledge.

Toward Fellowship and Advancement,

MR




References:

1. Anatomy of a Large-Scale Hypertextual web Search Engine
Brin, Page. 1996

2. Distributed Indexing: A Scalable Mechanism for Distributed Information Retrieval
Danzig, Ahn, Noll, Obraczka. (danzig@usc.edu)

http://www.techcrunch.com/2006/08/08/web-20-the-24-minute-documentary/trackback/