Wake Up Web 2.0

Wednesday, August 30, 2006

Roadmap to Realizing the Open Platform's Promise

Roadmap to Realizing the Open Platform's Promise

We've seen what can happen when powerfully integrated technologies can mash together complimentary data and processing components from various independent sources, and extend the functions to a larger set of capabilities.

We're watching web pages grow up to become services, or to use services. We're seeing web
software operate anywhere.

The networking and (meta)informational structures are both experiencing wide development.

Until now, there has not been serious discussion related to implementing a completely open data store, along with it's open components for serving, retrieving, analysing, and publishing data from the tool set.

The time has arrived for a myriad of open data stores to be supported to become a resource more vast, powerful, and leveling than anything the world has never seen.





What is the Opportunity?

The opportunity is to combine the recent developments in social computing and interoperable document exchange formats to create a scalable document and resource repository to make a large amount of documents available to be structured, tagged, classified and processed by users of the open public resource.

This technique combines the meta-data promised from the semantic web, the community tagging made popular by Web 2.0, as well as the distribution system from the Handle system, and ontological classification systems such as the Ontological Web Language.


Most organizations won't aim to house all of the data that is available.

Even for the ones that attempt to 'grab it all', they won't be able to organize or structure it on their own.

That is why the distributed system for summarizing other resources is so important.

Some groups will attempt the current model of trying to grab everything, (and doing very little human sorting and tagging with that unmanagable amount. The amount is especially unmanageable when it is behind a black box to the rest of the world.) Current data stores suffer from too much data, and not enough knowledge.

Many other groups will find that there are many companies that are actually already pretty good at "getting everything".

They just aren't good at finding everything when you need it. Without structure, organization or even an attempt at classification, it's really a tangled and unmanagable web.



New specialty "search engines" will find their true value-add by concentrating on storing only documents within certain domains of knowledge, or only indexing portions of documents, or structural relationships among documents.

I hesitate to continue the usage of the word "search engine". The term is dubious, a search engine consists of many components including:

Document / Resource Index
*Algorithm for relevance
Algoritm for Sorting by Influence or Authority

* One great way for relevance beyond keyword searching, in fact, the only INTELLIGENT way to do it, is by using human-created conceptually structured Ontologies (Ontological Web Language - OWL)

The major problem of today is that the index is locked down, only documents are given.

The algorithm doesn't work with real conceptual structures, it deals mostly with keyword density.

Thus, authority is still very easy to fake, since it is determined probabalistically.

The engines never know who to trust, or what is what, or what belongs where - so they take votes, and often, the crowd is wrong, and manipulators of the crowd can cheat the results, whether they see it that way or not.

The current "engines" treat the world's knowledge as nothing more than a string of symbols without any kind of anchor to a structured conceptual framework.

The influence algorithms will always remain a vital part of information retrieval technology, and companies such as Google will continue to thrive on their intelligent processing techniques.

What the opportunity is, to level the playing field related to processing large document stores, thusly enabling distributed ontologies to be created that can be applied to large amounts of important documents.

The new service doesn't promise to get every last document in the world. It promises to know what the ones that are important for what topic. Where they belong, what they are about, and what documents are about them.

It also allows anyone to process as much of the raw data in the index as they wish, which they can summarize, and then offer their summaries as part of their own searchable resource.




What are the limitations of the current system?

When google crawls altavista, it treats it like every other web page, based on it's keywords.

Never mind the fact that altavista happens to contain billions of web sites available by plugging in keywords.

This is only treated for how it appears as a "site" to a browser, not how it offers it's "information" to its users.


The same problem occurs with specialized search engines and resources.


Many times, a document is not the best resource for a query, and a resource such as a specialized search engine is an entirely viable return for a query. It may be best to think of these resources as complimentary results that enrich rather than replace the current search tools.

Something like Google won't be able to handle this, because every resource is just another page to them.

Whether you can search 1 document, or 10 billion, google goes by the external pages only.

So, you're missing out. You can't go where Google hasn't gone. Nevermind the fact that Google never really knows where it's going. It doesn't know what it's pages refer to, it could be intended as fiction, or fact, or it could be presented as fact but be otherwise. In any case, it's all just a list of symbols as far as Google is concerned.

"The wisdom of crowds...isn't."


The benefit is for other groups to query large engines with specialized queries, or indexing certain components, and then summarize this data, and make it available to their own searchers (and other data banks).

This can be thought of as pre-processed data.


How would you like your information retrieval oracle to bring you this message?

"We've already search everywhere for what you want, and we've found that these are the places that have the most organized occurences in their query results. SEARCH HERE"

I don't know about you, but thats the kind of information I'm looking to retrieve.

Its the classification, stupid.





What allows this to happen now?

Well, every network is effected by itself as well as other external networks and users.

The web industry has made real strides in software interoperability.

Most of the techniques that comprise the new "open platform" have been known for at least 5 to 10 years, if not much earlier in a more general form.

Hardware directly allows the information to scale, and hardware cost reduction has also allowed more users to come on board.

Someone could write a book on the emerging formats and hardware and user patterns, and what drove what.

I think it's not so important as to who is driving, but where the destination is going to be.


Lets look at this further.


What are the components?


The first compontent is a general index of resources. A resource could be a single document, a small organized group of lists, or a vast index with various levels of structure.

The network architecture for system to synchronize and communication will be implemented by the Handle system, similar to what manages the DNS servers in a non-central way.

Unlike the DNS server, there will be more layers than just the IP address and domain name.

There will be a Topic Index Layer, Various Levels of Index Broker Layers, as well the Site Broker Layer (what we currently think of as a "search engine"), and of course the documents.

This is basically a minimum of 4 tiers, with the index brokers having many levels of organization possible.

(Remember, the classification systems, known as ontologies form another independent portion, as well as the lexicon(s) used by the searcher and ontologists.)


The topic broker is the "oracle" that aims to point you in the right direction, even if that is only to another search engine of search engines. (don't worry, its going to be automatic)


The index brokers will tell you what search engine(s) you should use.

The site brokers will act as a current search engine, but they will be often be evaluated based on their own aggregate scores.

Think of search engines themselves being scored by their own keyword or pagerank scores, based on how good they are at finding certain things. This may include things like relevant document density, (similar to keyword density) as well as document structure, which ensures that the document is not a dangling page, but part of a larger knowledge web.

And, finally, my personal favorite layer, the final end result documents we were looking for all along.


This may sound complex, but it's truly how the open platform has already been planned.




Who is going to do this?


The Open Information Platform will take many players working in concert to achieve it's goals.


The first layer is one or more topic brokers.

These will be the ones who try to get everything, but unlike Google, they don't pretend to be an expert about everything.

If another index knows more than they do, they'll tell you, and actually bring you there automatically.



The index brokers will specialize in topics, or restricted indexing of specific components of web documents, either over a broad or narrow set of documents.

Other index brokers will be indexers of other indexers.

The site brokers act like today's modern search engine, in that they return documents, but they operate with a DEFINITE ontology, with authority (may be custom configurable and come from yet another 3rd party (actually, it would be a 6th party, but who's counting)

This allows these site brokers to return documents based on authoritative models of relevancy, not just keyword heuristics.


Then there are the documents.

I didn't want to leave these guys out.

These guys will interface with the ontologies in both explicit and implicit ways, both on and off their sites.

Parts of the new meta-data exchange will be enhanced by extra tagging, and "trackbacks" that let you know what places index, cache, warehouse, summarize or monitor that document.

This is primilary used for the other engines to get alerted when an update occurs, rather than having to chech 3 times a week like Google does.

Documents, their additional object naming characteristics (documents become entities beyond their simple location.)

A location can always be down, or moved, or abandoned.

The idea of a name or a document being "down", is just as absurd as the concept of "Alexander the Great" being "down" on all areas of the web, like wikipedia and all other engines.

Other ontological information related to the documents will be stored by third party ontology providers that work using the document names and extended properties.






So, to say it again, who are the players?

The architects. They make the standards and solutions. Or, at least the core system of protocols to communicate the objects of the repositories, their names, and the locations of their extended data and relationships.

Actually extending the data and the relationships among objects is the domain of the ontologists.

So, there are many layers, including the Network, the ontologies, as well as the interfaces, and various processing and sorting algorithms for relevance, influence and credibility.

The topic and index brokers. They go for everything within their domain, and specialize in redirecting people to the best resources, not the best documents.

As well, this new open platform will give rise to the potential to process, research, summarize, map, maintain and support the ever-increasing wealth this information can bring when made open.



Effects on the World :

This technology and it's integration will do more than transform the world of "IT", or the "internet".

Its scope will ripple to reach every corner of the earth.

Time is inherently more valuable to someone having access to this toolset.

This is a large change in the relative value between time and capital, or other physical property.

Time will continue to leapfrog in value, and not the least for the underprivilaged.

To put it simply, the implementation of this Idea is the Trickle Down theory of Knowledge promulgation.

Nothing, and I mean nothing will do more to close the digital divide, than this format.


Forget about putting computers in schools and villiages.

When this open platform reaches it's inflection point, the hardware will invest in itself.


This concept is truly the Reaganomics of philanthrophy through growth in to economies rich and poor.

No other media format will level the world as much as this.

And our society will never be quite the same.




Propagation Strategy.


So, who are the early adopters? How do we deal with issues of existing organizations not wanting to lose their competitive edge using their closed networks?


There are many groups who need to become more aware of how close we are to achieving this open system of Information Retrieval.

They are:

The Bloggers - leading the front line in Tech, Business and Politics. These are the scouts and rangers of the Movement.

The Academic world - This group has engineered the infrastructure and will process and shape it's progendy. They are the research labs.

The Corporate World - The ones who brought all the components together, and will eventually adopt and integrate this system to their offerings, either as a server, or client, or a complex blend. (The prosumer, to use a Alvin Toffler word.)

Government - You guessed it, the State. No one can invest in infrastructure like the Government. And, no one stands to gain more from the industry this platform will incubate.

There are also other groups, such as non-profits, non-government organizations, philanthropic organizations, as well as political groups who can benefit from this framework.



Challeges -

The very naming of a concept this vast is highly problematic.

Saying that it is an upgrade to the "web" is quite demeaning, because the web only applies to HTML documents made primarily for display purposes.

Terms like web 2.0, semantic web, social computing, active networks, distributed indexing may talk about one benefit, or allude to an evolution of sorts, but they all fall short in describing it.

Neural Name Net ?

Hey, it's got something going for it.



Everyone of the terms has something going for it, but misses others.

Other challenges will be dealing with resistance from existing providers of information retrieval technology, and large data warehouses like myspace, google, amazon, and all the rest of the capitalist machines.



Personally, my latest codename for the project is S.C.O.P.E -

Social Computing Open Protocol Environment

It also has a meaning as in Scope, the tool that one uses to scope.

And, the act of scoping.

As well as the scope of what is being surveyed, as well as the grammatical idea of scope, which determines the scope of a modifier - this is similar to meta-data.






[I had originally made a reference to the parable of david and goliath, and tried to change the story to fit my needs..

It turns out this was ill-conceived after a marathon blogging session.

The correct parable is the Tower of Babbel, for humanity to build a tower to reach the heavens.]


One one side is The old way of communications, with his war chest of cash, his data, his algorithms, his designers, his mathematicians, his venture capital ties, his investment banking clout, armies of shareholders behind him.

It would seem like nothing could stop this process.

But look, just next to every company is another company, with just as much cash, data, algos, and just as many people supporting his reign.

Now it seems there are a sea of companies, all seeking to take from one another, or to buy them out outright.

It would seem that nothing could get in the way of the mega-companies that combine the great Golliaths of the past.

But, in the distance, without anyone listening or paying attention, is the linguist, having nothing but his own language for cooperation.

He gives nothing but a whisper to the meekest of groups, and at that moment, every company will now either have to speak the new Language of Collaboration, or be history.





Let knowledge trickle down and let ye breathe wisdom.


MR

0 Comments:

Post a Comment

<< Home