Wake Up Web 2.0

Friday, October 20, 2006

Google Needs a Classification System - such as Dewey Decimal

 
Google Needs a Dewey Decimal System
 
 
Google's algorithm is already very useful for what it does - Delivers 
likely possible solutions to what might have been the intent of your 
search.
 
It will always be good at that, and continue to get better at doing 
JUST THAT.
 
But, without a dewey-decimal-like system, it falls short as an 
information retrieval ideology.
 
Google as ideology fails when it cannot realize that the world's 
library belongs to the world; not to a private advertisers. Or information 
hogs. Or media giants. Or software companies.
 
In, Of, and By the Public this Library will be.
 
 
 
Benefits of Distributed system.
 
 
In a purely technical sense, the distribted system's does not directly 
make search better, in and of itself.
 
 
It allows search to become better indirectly by making the development 
of search filters and classification systems to be developed and shared 
by everyone in a standard and low-cost environ.
 
Aggregating resources is extremely easy with this system architecture. 
This will encourage many current web publishers to adopt rigorous niche 
classification systems that fit into larger-scope Labeling systems.
 
 
 
It also lowers the cost of innovation in all aspects of the search 
process, by making crawling obsolete, and by making aggregation of 
resources a technologically trivial process (even if it is still a complex and 
multifarious process in terms of the information architecture, the 
"code work" and interfaces are all taken care of. There will be various 
competing providers of the classification schemes, and the markets will 
decide what ones thrive. The architecture allows these to be made 
available to the user.
 
In the exact same vein, it opens the market for enhanced providers of 
filters and algorithms, and various processes that sort, map, and 
predictively analyze the result sets imported from other tools.  
 
In this way, the distributed system greatly changes the economics of 
the many search engine development areas.
 
By isolating the various components of the "search engine', this new 
system will foster greater custmization and innovation - The components 
being the aggregation, the relevance processes (either symbolic, 
conceptual, authoriative, socially computed, or a blend) and the sorting and 
filtering algorithms.
 
The distributed system also changes the politics of search indexes, 
social networks, and any kind of structured resource.
 
 
The current system does not aid those that seek exhaustive, 
authoritative, definative document sets...not sets of possible matches.
 
 
 
When the search result set represents a complete, spam-free, and is 
clear of off-topic content, then it can be seen as a datum in and of 
itself.
 
It becomes an index, of a certian kind of query intent.
 
This datum then can be meaningfully combined, or summarized and 
combined with other complimentary, or supplimentary resources to allow this 
strucuted index to be used by other resources to provide greated depth 
and breadth to the search experience.
 
 
 
The other fundamental limitation of Google is it's ignorance of the 
intent of the searcher.
 
When searching for "jaguar", it gives some possible document lists.
 
But, to result of the "relevance" process, and the sorting process 
cannot be considered aything but an attempt at possible relevance. 
 
The sorting the results as a SET in this case is informationally 
valueless, because the results collect from various contexts of the term - 
the cat, the car, the sporsclub.
 
Its true that a subset of the results represent value to that searcher. 
But it is the laborious job of the searcher to define that subset for 
himself, unfortunately one document at a time. And, worst of all, even 
after the searcher has essentially sorted the search results based on 
relevancy for his intent, the information is usually uncaptured, or 
completely lost, and even the implicit indirect clues lying in 
user-behaviour that could be used to corrolate to the user's intent and satisfaction 
with the results are horded by a company that can't benefit from it's 
value, due to it's antiquated strategic central position as aggregator.
 
So, it is definately the classification that is the largest improvement 
of all.
 
The distributed platform doesn't directly offer any solution itself to 
classify all objects by mechanical processes.
 
But, it does allow systems of communication by PEOPLE, who after all 
are the final judge of the merit of a search results for a query intent.
 
The searcher is the final judge of relevancy, why not make them the 
primary judge?
 
After all, keyword density, pagerank, anchor links, are only 
CORROLATIONS with authority and usefullness, they do not represent this 
themselves.
 
And, the more the patterns are known by manipulators, the more they can 
be gamed in that fasion.
 
Google suffers an adversarial classificaiton problem, there's always a 
way in - spam, or investment in white hat SEO.
 
 
 
 
By lowering costs to build ontologies, (the dewey decimal systems of 
tomorrow), and advanced search processes and filters - distributed 
indexing has an indirect on search relevance, and cost.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Crawling the web is now a redundant process.  Just like other examples 
of peer to peer technilogy making processes obsolete such as ripping 
your own CD, when you can just download the MP3 from a p2p network. 
Storing for the first time is done. Its just re-getting, finding from one of 
many shared resources. 
 
 
Looking beyond the page - 
 
References - Google only makes it's final sorting decision based on 
popularity, not the degree to which a document references other 
authoritative sources.
 
That means a document that is highly referenced will have priority over 
those that are not.
 
All else being equal, this is a good technique for determining a 
corrolation to authority.
 
 
Some searchers may only want to search sources that themselves are well 
documented, well-cited, and employ rich sets of references within their 
document.
 
 
Having references within their document, to other authoritative 
documents is a corrolation not only with relevance, but also authority in the 
true sense of the word.
 
 
It is true that Google does give some benefit to those linking out to 
sites with Good trust rank, and that outgoing links can effect the SEO 
results, but this is more of a way to avoid the spam filter than 
something that will cause a surge in the rankings.
 
Basically, there is no way for the searcher to specify the degree of 
references in the documents they search. This is a limitation.
 
Not all queries require a document heavy in citiations, but, with spam, 
and shoddy content, made-for-AdSense pages, the option to filter 
documents without citations and refences is a necessary one for true 
knowledge retrievers.
 
 
 
 
 
Document Density (of Topics or Concepts)
 
 
Just as contemporary web "Page" based search engines use "keyword 
density" to estimate or predict the likely relevance of a resulting page, 
the new system of searching "resources" will be largely influenced by 
that resources proportion of documents pertaining to the given topic.
 
For instance, one resource may have fewer overall results, but have 
results almost completely dedicated to your intended information.
 
THis is certainly a more time-efficient resource to begin with, and 
resources of this type make better resources for beginners and experts 
(perhaps not the same ones, but denser resources will be better in both 
instances)
 
Sparse resources will many times actually give more results.
 
Anyone who's gotten over 1 million Google results for a query probably 
hadn't made time to go through them all, and probably realizes that 
this would be useless.
 
Sparse results mean that although the resource contains results, since 
it is a miniscule part of its document set, it is unlikely to have 
well-structured indexes and supplemental indexes pertaining to the topic 
and is also likely to lack useful query-intent clarification procedures 
to your specific intent.
 
 
The sparse resource may have fewer results, or may have more results.
 
THe more results usually contain garbage, spam, and "lesser sources", 
one that probably aren't worth of informational content, but are 
on-topic nonetheless.
 
Even if the sparse collection contains truly more "good resources", the 
fact that they are essentially HIDDEN in the result set with other junk 
makes it a less valueable resource.
 
The dense sources have none of these distractions.
 
The sparse sources may have more 'good resources', but they don't have 
any classification system, or sub-indicies, or query clarification 
tools to let a person search with tools that pertain to their domain.
 
In a sense, sparse sourcss are just one area of a tangled web. The 
larger tangled web isn't structured conceptually, and neither will be the 
'good resource' page results, however numerous they are.

0 Comments:

Post a Comment

<< Home