Sunday, April 22, 2007
Thursday, November 16, 2006
Semantic Web 2.0 Mashup
The online marketing world is finally realizing that social media is here to stay.
The first step was being amazed by how quickly leaders such as Myspace could tap into the market created by sites like friendster, and promoted through existing legacy networks such as the AOL instant messenger and the AOL messenger users.
The attention that big media has paid to these large networks was another big part of all the hype surrounding social networking.
Marketers and Publishers are still trying to figure out what social networking is, and what makes it so popular. Why now?
Other trends include Web 2.0 , application / data mashups, and the ‘semantic web’ envisioned by www creator Tim Berners-Lee.
The truth is that social networking is as old as the web. Newsgroups, USENET, forums, bulletin board systems were all the traditional ways of online social networking.
Most of the components that comprise a social network, such as user profiles, picture galleries, forums, groups, rate-me, etc., had already been around.
Three novelties that allowed the acceleration of the new generation of social networking sites were:
- One stop shop for all “my” things. Music, Video, Friends, Work, Info
 
- Public information of who your friends are. 
 
- Viral marketing techniques to promote social networks on away messages.
Never before had all of these features been available in one place that could be at the same time available to everyone else.
The public buddy list allowed you to delve into social groups that you might not have been able to locate just on keyword searches alone. By being able to see who is friends with who, the user has a vastly greater perspective as to the interrelationships among the friend users.
Forums in the past focused more on the topics, and the postings of the users, not on the friends and the detailed life of the user.
The focus away from the intent of the forum, and put the person and their social connections at the center of the service.
Social networks have also flourished by viral marketing campaigns on media such as away messages. These have offered their participants to sign up for their free page. Also, networks have strategically forced users to register for a profile if they want to see the photos or extended information of other users.
So, the social networking sites are here to stay. Now what?
What impact do these have on the entertainment / publishing universe?
What impact do they have on the current Web Search, Web advertising marketplace?
Social networks are all about decentralization. They enable groups to form and disseminate ideas and media without needing specialized web knowledge.
Social media has been around forever, but now it has crossed over to the mainstream.
It is now part of the young internet user’s everyday internet experience.
Myspace comments and mail messages are just another form of email. Myspace photo albums have replaced the other free digital photo gallery services.
The idea is all of a person’s digital files in one place, and every person in one place.
What are the possibilities with social networks?
What are the limitations of social networks as they currently are structured?
One thing that is clear is that social networks like open-ness.
AOL’s member profiles were 100% hidden from non-AOL members.
The result: AOL’s member profiles became irrelevant.
Large social networks may have much more openness in terms of free se, open information, but their owners will still prefer to restrict what kind of content they accept.
Originally there was a large controversy when Myspace users were linking to videos on revver.com, a video sharing site. Myspace had a policy of deleting such links, as they were apparently a competitive service.
These knee-jerk reactions show the weaknesses and vulnerabilities of a business model such as Myspace which on one hand relies on open standards and open access, and free flow of information, and on the other hand wants to maintain a degree of control over the services its users adopt.
No publishing format in web history has had so much success.
It is very reasonable to believe that the structures, relationships, features and tagging ability that is either supported or supportable by these social networks will impel developers on the leading edge to develop new standards that expand the legacy WWW, and HTML formats.
Social Media and Search:
Social media is great first because it presents an alternative to search.
Social media really offers another option between a long term, challenging, unpredictable natural SEO campaign and an expensive, take-it-or-leave-it PPC pricing environment on the major engines.
Marketers can engage their pre-formed, pre-screened audience in a direct and non-intrusive way with their message, product or service.
Marketers can also obtain market research, demographic, psychographic, and behavioral information with the greatest of ease.
Social networks have become one of the largest and most diverse collections of focus groups.
How will social search effect standard search engine results?
Large scale search engines such as Google were designed to index and retrieve documents from HETEROGENEOUS document collections. This means that the documents have no common thread regarding topic, length, format, or components.
Some documents may contain general knowledge, others contain technical information. Some documents are lengthy, others are terse. Some documents are on one topic, others on another. There is absolutely no similarity among the documents other than that they use standard HTML markup strategies and their hardware conforms to the WWW standard.
Searching heterogeneous data collections is a unique problem, and one that large companies such as Google have made a great deal of progress on.
One of the most interesting aspects of social media, is how it has created on of the largest datawarehouses of HOMOGENEOUS documents.
Profiles and other sections on Myspace are Homogeneous because each page contains certain structured information such as Name, Location, Favorite Music, List of Friends, Blog Article, etc.
The fact that the data collection is fully structured means that many of the generalities that drove the engines such as Google do not apply, and that more specific and powerful, and relevant techniques for information retrieval can be applied to said datasets.
Currently, it is believed that engines such as Google, Yahoo and MSN treat social networking pages, (as well as social bookmarking pages) just as any other page.
Note: Myspace profile pages are indexable, but the other information is not.
Social networks also have the power to make a lot of the questions in search no longer relevant.
Who cares how to construct an algorithm to predict hard to find spam, when user behavior pins the tail right on the spam for you?
If distributed human computing and behavior is enough to predict what is relevant, spammy, and authoritative, then, who needs AI?
The vast availability of user data, on top of that user data being correlated with search behavior, advertising behavior, shopping behavior, even television viewing behavior, this data now becomes a rich source of knowledge.
New search engine AI will not be totally separated from the real human world, it will become inter-twined with it.
The most challenging aspects of knowledge retrieval and discovery will take place with a combination of the best computational processes along with the most elegant introductions of the social behavior and human interaction with data sets.
What are the new strategies for search marketers? What are things to watch for?
Social networking allows the quick and easy dissemination of information. Any negative sentiment can be spread lightning fast to almost all the important venues online.
Instead of managing “branding”, managing the REAL brand experience is now the prime importance.
Companies will find that there is no shortcut, or way around the reality they create with their users and customers.
In the new world of social media it’s publish, or die.
Engage a community, or be forgotten.
Social Media and Video content.
What social media has done for online chatting and some of its benefits of information retrieval through social tagging are really only the beginning of its transformation of the media landscape.
Large companies get this trend. They see these as the new media networks.
Myspace is the modern day MTV, which dominates the youth audience with a new and improved format.
Social networks will continue to play a part as the network that offers original and syndicated and linked content to it’s audience.
The mass entertainment, as well as the niche information and entertainment areas are currently being developed by major media companies.
The other networks had been delaying their push into broadband video entertainment until being beaten over the head by Myspace and its success.
Myspace has begun to vertically integrate itself by diving head first into the media universe. They have a joint venture with Interscope records, and have already released compilation and standard albums of their signed talent. They are also in development of a movie production studio, a television studio, and various Satellite Radio stations.
Myspace: A place for friends.
The question every Web 2.0 idealist is asking is, where is the place for friend feeds?
Sunday, October 29, 2006
Forms of Thought
Topology of Thought
Every system with a top-down component of structure needs some rules by which it is governed.
The simple coherence, and integrity of the system is based on it's adoption of universal rules and structures.
Hence, the model incorporates the of logic of both descriptive and prescriptive methods.
Both inductive creation as well as deductive determination are supported.
The bridge between the form of thought, and the content of language is crossed by symbolic representation.
This is a model of symoblic representation. Or, rather, a meta-model.
This model includes all systems of deductive and inductive logic, as well as conceptual spaces, interpreted by symbolic logical systems.
The classes of compontents to the system:
1. Logic - Grammars - This is the rules governing the logical systems. It defines the process by which sentences, questions and proofs are interpreted.
Grammars represent well-formedness of conceptualization within the model. They also contain the logical process interpretations to evaluate truth statements or procedures.
The logic or grammer class has falls into the Interpretation super-class of core classes. This is because is plays a clarifying or elucidating role.
Other synonyms that relate to the logical layer are : system, network, constraints, meta-model, graph, constraints, rules, game.
As a whole class Logical Rule systems can describe collections including itself. As well, sub-classes of rule systems include Processes and Procedures, that are NOT self-referrential.
The core essence of a logic is that it is desriptive. The core property is well-formedness.
- The next two core classes fall into the type of "Identifiers", since they are involved in the process of naming, and identifying objects as they pertain to the platonic world of forms, or the material.
2. Concept
It represents some "space" projected from some model.
Concepts have internal topologies as well as external topologies and properties within their space.
Other notions of concepts are abstractions, simplifications, generalizations, . Understandings and implications are a sub-class of concepts.
Concepts are used in a bottom-up fashion. They are flexible, unrestricted, ad-hoc, and interpretive.
Their creation process is one of identification. Their nature is determined by their properties.
Concepts refer and may contain objects.
Concept creation is a process, and processes are concepts.
Symbols represent a bridge among concepts, to map and convey higher-order concepts.
Symbols as a technique prepresent the concept of conceptualization or abstraction.
Symbols as a useage however represent the sub-class concept of a grammar.
The concept is interpreted using a Model, a concept can or can not be a Model (Grammar). Some concepts are grammars.
Concepts have properties. Properties are founded on concepts.
The core process of a concept is formulation. Creation. Thought.
The core property is Intangibility. Abstraction. Idea.
3. Object
An Object has the core properties is their name or designation.
Objects serve the process of Identification, Encapsulation, Forming boundaries, Defining, Discriminating, Differentiating.
Objects are Concepts. Objects can be colletions, or atomic. Collections are objects.
A concept is one kind of object.
An object relies on a concept for it's meaning, but also refers to a specific name or designation to connect that meaning to a subspace of a conceptual space.
The object is a label on the map of meaning.
Words are objects. Objects use words as components to identify their scope.
Node, Edge.
In our model, a concept is an object, and an object is a concept.
4. Process - Procedures, that are NOT self-referrential. Role as Verb. Sub-class of Logic.
Procedures are applications of rules.
Proof is a Process. (The proof symbol is a separate object)
Establishment of truth or a result. Returning an answer.
Answer.
Processes enable consistency, compliance,.
Evaluation. Applicaitons.
Processes fall into the super-class of "Interprative Classes".
This is because their purpose is clarification, or elucidation.
Procedures are determinative.
They represent an applications of rules, to symbols that may or may not comply with those rules.
Processes take an abstract rule-bases system, and apply it to a symbol, which is presumably mapped to concepts or objects.
Process is the Action that enables language.
Process Verb. IN grammer a process applies to a subject. The subject of a process of typically a symbol, or space.
Depending on if it is an logical, analytic, algebraic, or graphical process.
In design, Process belongs to the "Structural" super-class.
In usage, Process belongs to the "Interpretation". super-class.
5. Properties. Are Attribues.
COmponents.
Defining Characteristics.
Data Members
Members.
Fields.
Properties can be "natural kinds", or composite properties aggregating concepts.
Every other class has properties.
THe concept of a property is that of characteristics.
The usage of a property is that of identification, or differentiation. Properties can be used to unify objects and concepts, or to divide them.
Properties are objects. (everything is an object)
Properties fall into the interpretation, or clarifying aspect.
Property names are symbols that are used to evaluate the scope of conceptual spaces.
6. Collection.
Resource. Data. Store. Set.
Simple ontology. List + document types.
Document Types are collections of properties.
Experienced through a frame of reference.
May be a conceptual collection which represents a notion.
Structure of Resources. Map of Resources.
Interpretations and Intuitions and Understanding are resources.
Collections are structural.
Collections are objects, contain objects, contain properties, may represent properties.
7. Symbol
Information. Small self-contained units of data. Composed of Nothing. Empty. Irreducable representations.
Representation, Data, Number, Abstraction.
Signal. Sign. Message. Word. Letter.
Names are symbols that designate objects.
Visualization.
Properties of symbols.
Symbols are objects, and make use of the concept of abstract representation.
Symbols have no components. Properties are interpreted, not intrinsic.
8. Relationships consist of a special kind of set or collection.
They consist of objects as well as a property.
THey connect collections to one another and properties.
Colletions (of objects, concepts, sybmols, properties, ...) can be related to one another, or other collections heterogeneous.
An object can be related to an object through a property.
And an object can be related to a property. (through another property)
They are used to signify a property of a collection.
They are enabled by the process of identification.
Connetions. Links. vertex.
Set of pairs with values.
Functions are a Surjective relation.
Relationships are used for "structural intent" or "structural use"
And are comprosed and enabled through "structural process".
9. Use. Meaning, Purpose, Concext.
This refers to how an object from a class is used.
Use is functionally a representation for Meaning.
Behaviour type.
Part of Speech represents a use of a collection of symbols as it relates to the process of relating other symbols.
ACtion. Use is a type of process.
Purpose. Meaning. Concext. Semantic Context.
Essence.
This concluded the core classes that have well-defined properties.
Another core class is that of the user of the language.
10 . The User
The reader, listener, or audience.
Agent. Client. Speaker. Input. Frame. Boundary. Terminus. External Node. Uncertainty. Externality. Limitation. Environment.
Frame. Perspective. Dimension. Vector.
Other supplimentay kinds of class types.
Operatior - Conjuction.
And, or, if , if then , then if, if and only if. else. until.
Perepositions, Postpositions. Related to relationships between objects related to spacial and orientation metrics.
(applied to spacial dimensions.)
Punctuation - Document Type. Model of Well-formedness.
Word - Symbol
Pronoun - Pointer (special kind of symbol that points to a location in space.)
Article (a, the) - identifier that contains the property of context, scope, semantic meaning.
Adjective is a property.
Definition - Core property, Axiom.
Meaning, part of speech, use - Intent, Use
Relation - Set Function, Value
Sentence - Collection Notion thought Resource message
Participant - User
Adverbs - Modifier of process, or meaning, Descriptor of meaning.
that modify verbs - sub process, Meta-process- ex. Process Differently - or Process Descriptively.
that modify adjectives - sub property. meta-property. Ex. Identify Meta-describtively
that modify adverbs - Purposefully Alternatively Descriptively Process
One can see that the adverb plays a special role, and needs further investigation.
Tense
Voice
Aspect
Implicature
Friday, October 20, 2006
Google Needs a Classification System - such as Dewey Decimal
  Google Needs a Dewey Decimal System    Google's algorithm is already very useful for what it does - Delivers likely possible solutions to what might have been the intent of your search.  It will always be good at that, and continue to get better at doing JUST THAT.  But, without a dewey-decimal-like system, it falls short as an information retrieval ideology.  Google as ideology fails when it cannot realize that the world's library belongs to the world; not to a private advertisers. Or information hogs. Or media giants. Or software companies.  In, Of, and By the Public this Library will be.      Benefits of Distributed system.    In a purely technical sense, the distribted system's does not directly make search better, in and of itself.    It allows search to become better indirectly by making the development of search filters and classification systems to be developed and shared by everyone in a standard and low-cost environ.  Aggregating resources is extremely easy with this system architecture. This will encourage many current web publishers to adopt rigorous niche classification systems that fit into larger-scope Labeling systems.      It also lowers the cost of innovation in all aspects of the search process, by making crawling obsolete, and by making aggregation of resources a technologically trivial process (even if it is still a complex and multifarious process in terms of the information architecture, the "code work" and interfaces are all taken care of. There will be various competing providers of the classification schemes, and the markets will decide what ones thrive. The architecture allows these to be made available to the user.  In the exact same vein, it opens the market for enhanced providers of filters and algorithms, and various processes that sort, map, and predictively analyze the result sets imported from other tools.    In this way, the distributed system greatly changes the economics of the many search engine development areas.  By isolating the various components of the "search engine', this new system will foster greater custmization and innovation - The components being the aggregation, the relevance processes (either symbolic, conceptual, authoriative, socially computed, or a blend) and the sorting and filtering algorithms.  The distributed system also changes the politics of search indexes, social networks, and any kind of structured resource.    The current system does not aid those that seek exhaustive, authoritative, definative document sets...not sets of possible matches.      When the search result set represents a complete, spam-free, and is clear of off-topic content, then it can be seen as a datum in and of itself.  It becomes an index, of a certian kind of query intent.  This datum then can be meaningfully combined, or summarized and combined with other complimentary, or supplimentary resources to allow this strucuted index to be used by other resources to provide greated depth and breadth to the search experience.      The other fundamental limitation of Google is it's ignorance of the intent of the searcher.  When searching for "jaguar", it gives some possible document lists.  But, to result of the "relevance" process, and the sorting process cannot be considered aything but an attempt at possible relevance.   The sorting the results as a SET in this case is informationally valueless, because the results collect from various contexts of the term - the cat, the car, the sporsclub.  Its true that a subset of the results represent value to that searcher. But it is the laborious job of the searcher to define that subset for himself, unfortunately one document at a time. And, worst of all, even after the searcher has essentially sorted the search results based on relevancy for his intent, the information is usually uncaptured, or completely lost, and even the implicit indirect clues lying in user-behaviour that could be used to corrolate to the user's intent and satisfaction with the results are horded by a company that can't benefit from it's value, due to it's antiquated strategic central position as aggregator.  So, it is definately the classification that is the largest improvement of all.  The distributed platform doesn't directly offer any solution itself to classify all objects by mechanical processes.  But, it does allow systems of communication by PEOPLE, who after all are the final judge of the merit of a search results for a query intent.  The searcher is the final judge of relevancy, why not make them the primary judge?  After all, keyword density, pagerank, anchor links, are only CORROLATIONS with authority and usefullness, they do not represent this themselves.  And, the more the patterns are known by manipulators, the more they can be gamed in that fasion.  Google suffers an adversarial classificaiton problem, there's always a way in - spam, or investment in white hat SEO.        By lowering costs to build ontologies, (the dewey decimal systems of tomorrow), and advanced search processes and filters - distributed indexing has an indirect on search relevance, and cost.                                  Crawling the web is now a redundant process.  Just like other examples of peer to peer technilogy making processes obsolete such as ripping your own CD, when you can just download the MP3 from a p2p network. Storing for the first time is done. Its just re-getting, finding from one of many shared resources.     Looking beyond the page -   References - Google only makes it's final sorting decision based on popularity, not the degree to which a document references other authoritative sources.  That means a document that is highly referenced will have priority over those that are not.  All else being equal, this is a good technique for determining a corrolation to authority.    Some searchers may only want to search sources that themselves are well documented, well-cited, and employ rich sets of references within their document.    Having references within their document, to other authoritative documents is a corrolation not only with relevance, but also authority in the true sense of the word.    It is true that Google does give some benefit to those linking out to sites with Good trust rank, and that outgoing links can effect the SEO results, but this is more of a way to avoid the spam filter than something that will cause a surge in the rankings.  Basically, there is no way for the searcher to specify the degree of references in the documents they search. This is a limitation.  Not all queries require a document heavy in citiations, but, with spam, and shoddy content, made-for-AdSense pages, the option to filter documents without citations and refences is a necessary one for true knowledge retrievers.          Document Density (of Topics or Concepts)    Just as contemporary web "Page" based search engines use "keyword density" to estimate or predict the likely relevance of a resulting page, the new system of searching "resources" will be largely influenced by that resources proportion of documents pertaining to the given topic.  For instance, one resource may have fewer overall results, but have results almost completely dedicated to your intended information.  THis is certainly a more time-efficient resource to begin with, and resources of this type make better resources for beginners and experts (perhaps not the same ones, but denser resources will be better in both instances)  Sparse resources will many times actually give more results.  Anyone who's gotten over 1 million Google results for a query probably hadn't made time to go through them all, and probably realizes that this would be useless.  Sparse results mean that although the resource contains results, since it is a miniscule part of its document set, it is unlikely to have well-structured indexes and supplemental indexes pertaining to the topic and is also likely to lack useful query-intent clarification procedures to your specific intent.    The sparse resource may have fewer results, or may have more results.  THe more results usually contain garbage, spam, and "lesser sources", one that probably aren't worth of informational content, but are on-topic nonetheless.  Even if the sparse collection contains truly more "good resources", the fact that they are essentially HIDDEN in the result set with other junk makes it a less valueable resource.  The dense sources have none of these distractions.  The sparse sources may have more 'good resources', but they don't have any classification system, or sub-indicies, or query clarification tools to let a person search with tools that pertain to their domain.  In a sense, sparse sourcss are just one area of a tangled web. The larger tangled web isn't structured conceptually, and neither will be the 'good resource' page results, however numerous they are.
The SCOPE of Progress
The Public Venture : How public resources will revolutionize data markets and information retrieval.    I show how centralized sources, obscured by secrets and limitations, are not the future of finding information.  I see technology allowing large data sets to be interoperable, to effectively merge, while I forsee the underlying business interests and markets on a course to fragmentation into various components of the info retrieval process. (crawling, indexing, aggregating, categorizing, sorting, filters and interfaces)  I give an idea of what the world might be like if social network profiles began to use a system of publishing now used by blogs. This would be a more 'open' and free form of mass publication, due to the non-exclusive role of large publishers to control their medium.    I shows how in his world, the popular social networking activity would never be able to be acquired or controlled.   I no longer see the social network services delivered by the current giants as remaining the  "destination" for their users. They will be reduced to just a few of the many tools people use to reach and communicate with the larger outside databases - The SocialSphere.  This doesn't mean that these companies won't be wildly successful as media companies, it means that their power as exclusive social networks is inherently unstable, and they will constantly have to re-define themselves.  Myspace will always be a successful web destination, but it's focus on core social networking will have to get less and less, (just like Google's focus on search)  Google says it's core is 70% search, but I think most of that is advertising research and optimization. This clearly has it's place, but Google has realized that to improve their search DRASTICALLY, would be to admit there is a DRASTIC problem with their current technique.    Myspace will be the new MTV, with pretty faces, new studio promotions and large brand atmosphere.  The users will remain, but will begin using more powerful tools to go beyond only the myspace experience.    There is one area of innovation that large markets like Google and Myspace cannot even go near.    That is the market of becoming part of a larger world of information sharing.    Current, each social network is a gated community.   Even if membership is open, you can only search and see information from ONE site, ONE database.  This means that if myspace doesn't get around to implement a feature, it won't happen.  Even if the feature takes less than 1 hour of computer coder time.      Google will lose relevance as far as aggregating documents and lose some prominence in search, but will remain highly successful at incubating other internet properties, primarily in web media, and probably not as much in web widgets.  The reason is that anyone can make widgets, but not everyone can make large media acquisitions which place themselves as or with large publsher networks and advertiser demand.  So, now google will remain strong from their cash position, and their media portfolio, as well as their technological eminance as a data processor.  Google will have an exciting future as a company, more focused on advertising and media in the future rather than web document search.    Web search will continue to get more corporatized, with larger and more numerous advertising placements. If a paid placement program was to ever occur for indexing as it does with Yahoo's paid inclusion, it would mean the end of the Google resource as we know it.          Evolving advertising trends suggest advertisers will begin to track the entire search experience, along with email, social networking and video viewing experience to optimize campaigns just for YOU.    Top 12 things Google will never do.   1. Google won't ever know what it's searching.          No conceptual understanding.         No categorization scheme. Google doesn’t discern between fiction, nonfiction, reference, age-appropriate material, ...2. Google will never understand what your intent was from your keyword. It can't understand the context or purpose of a search from just symbols alone. It has no conceputal framework.
Can't refine based on conceptual query intent, not keyword. Offers no way to refine search.  Does "baseball" mean the "game", the "league" (MLB), or the "ball"? It is used in all of these contexts interchangably.3. Can't learn from users, or let users use groups to filter. The goal of 'making the world's information accessible, may be noble, but unlessy ou open the system to
LET OTHER make the world's information structured and accessable, it's going to remain a artificially "intelligent" GAME. - Feeling lucky?  5. Ownership and Control         Censorship.          Advertising corrupting experience or results.         Market position inhibits integration, and possibilities.         Non-monopoly status of search, and publishing industries.  By contrast, the Benefits of an Open Platform - open data, open summaries, open relevance scores, sorts and filters.           Customization, Upgrades, Flexibility         More competitive markets         Interoperability         Non-monopoly status of search, and publishing industries.         Everyone gets to use the cool toys at the Googleplex.    6. Never complete search of all libraries. If Google can't find it, forget about it. They won't point you in the direction of another resource that is known to have good results
for your kind of query. Google doesn't summarize other rich resources. They try to be the one resource. They can't admit their limitations.
Even Yahoo would direct you to another engine if your query gave no results.
You need to admit when another resource has better results. Google is too centered around 'documents' and not enough on resources as a whole.
   7. Always prone to spam, manipulation through data, manipulation with money, non-first-rate resources by design. There is bad spam, and then there is manipualation of results that are somehow approved, like buying web sites , buying links, doing PR stunts.
Does this lead to better results? Or just more "competitive" ones?
   8. No authority, or authenticity, or certainty. Everything is just a guesswork, probability, mass voting, mob rule, educated guesses. They give you up to millions of results, because, you AREN'T feeling lucky. I can't even trust which mob I want to rule my results? I'm stuck with the largest mob.   9. Advertising Banditry. Publishers don't know how much they make. Click fraud rampant. CPC model ensures advertisers pay the most. But, publisher sees little of this.Adsense inefficiencies.  10. Made only to search unstructured data; Can't understand structured knowledge. Google's intent, and very assumption for it to work, is that it knows NOTHING about what it is searching in AND what it is searching for.
Google was made to sort the structureless web. It's great at that, but the structureless web sucks.
     11. Only designed to crawl the outer web. The inner web is uncrawled.   Other applications of distributed computing:  Anti-manipulation, Pro-accountability in areas of
Accounting
Elections
Trade  Google is a tool, not an answer. It's initial results, in aggregate, are valueless without human sifting afterwards.    The environment that created Google was one of total lack of cohesiveness to the web.
Google Needs a Dewey Decimal System  Google's algorithm is already very useful for what it does - Delivers likely possible solutions to what might have been the intent of your search.  It will always be good at that, and continue to get better at doing JUST THAT.  But, without a dewey-decimal-like system, it falls short as an information retrieval ideology.  Google as ideology fails when it cannot realize that the world's library belongs to the world; not to private advertisers or a small band of dorks. Or information hogs. Or media giants. Or software companies.  In, Of, and By the Public this Library will be.      Benefits of Distributed system.    In a purely technical sense, the distribted system does not directly make search better, in and of itself.of search filters and classification systems to be developed and shared by everyone in a standard and low-cost environ.  Aggregating resources is extremely easy with this system architecture. This will encourage many current web publishers to adopt rigorous niche classification systems that fit into larger-scope Labeling systems.     It also lowers the cost of innovation in all aspects of the search process, by making crawling obsolete, and by making aggregation of resources a technologically trivial process (even if it is still a complex and multifarious process in terms of the taxonomy,  the "code work" and interfaces are all taken care of. There will be various competing providers of the classification schemes, and the markets will decide what ones thrive. The architecture allows these to be made available to the user.  In the exact same vein, it opens the market for enhanced providers of filters and algorithms, and various processes that sort, map, and predictively analyze the result sets imported from other tools.    In this way, the distributed system greatly changes the economics of the many search engine development areas.  By isolating the various components of the "search engine', this new system will foster greater customization and innovation - The components being the aggregation, the relevance processes (either symbolic, conceptual, authoriative, socially computed, or a blend) and the sorting and filtering algorithms.  The distributed system also changes the politics of search indexes, social networks, and any kind of structured resource.  The current system does not aid those that seek exhaustive, authoritative, definative document sets...not sets of possible matches.     When the search result set represents a complete, spam-free, and is clear of off-topic content, then it can be seen as a datum in and of itself.  It becomes an index, of a certian kind of query intent.  This datum then can be meaningfully combined, or summarized and combined with other complimentary, or supplimentary resources to allow this strucuted index to be used by other resources to provide greated depth and breadth to the search experience.
As long as there is still noise and potentially bad results, Google's results don't represent knowledge.
The user has to turn that information into knowledge by going through and evaluating the results.
    The other fundamental limitation of Google is it's ignorance of the intent of the searcher.  When searching for "jaguar", it gives some possible document lists.  But, to result of the "relevance" process, and the sorting process cannot be considered aything but an attempt at possible relevance.   The sorting the results as a SET in this case is informationally valueless, because the results collect from various contexts of the term - the cat, the car, the sportsclub.  Its true that a subset of the results represent value to that searcher. But it is the laborious job of the searcher to define that subset for himself, unfortunately one document at a time. And, worst of all, even after the searcher has essentially sorted the search results based on relevancy for his intent, the information is usually uncaptured, or completely lost, and even the implicit indirect clues lying in user-behaviour that could be used to corrolate to the user's intent and satisfaction with the results are horded by a company that can't benefit from it's value, due to it's antiquated strategic central position as aggregator.  So, it is definately the classification that is the largest improvement of all.  The distributed platform doesn't directly offer any solution itself to classify all objects by mechanical processes.  But, it does allow systems of communication by PEOPLE, who after all are the final judge of the merit of a search results for a query intent.  The searcher is the final judge of relevancy, why not make them the primary judge?  After all, keyword density, pagerank, anchor links, are only CORROLATIONS with authority and usefullness, they do not represent usefulness themselves.  And, the more the rigid rules are made apparent to manipulators, the more they can be gamed in that fashion.  Google suffers an adversarial classificaiton problem, there's always a way in - spam, or investment in white hat SEO.        By lowering costs to build ontologies, (the dewey decimal systems of tomorrow), and advanced search processes and filters - distributed indexing has an indirect and pronounced effect on search relevance and cost.      Crawling the web is now a redundant process.  Just like other examples of peer to peer technilogy making processes obsolete such as ripping your own CD, when you can just download the MP3 from a p2p network. Storing for the first time is done. Its just re-getting, finding from one of many shared resources.     Looking beyond the page -   References - Google only makes it's final sorting decision based on popularity, not the degree to which a document references other authoritative sources.  That means a document that is highly referenced will have priority over those that are not.  All else being equal, this is a good technique for determining a corrolation to authority.   Some searchers may only want to search sources that themselves are well documented, well-cited, and employ rich sets of references within their document.  Having references within their document, to other authoritative documents is a corrolation not only with relevance, but also authority in the true sense of the word.  It is true that Google does give some benefit to those linking out to sites with Good trust rank, and that outgoing links can effect the SEO results, but this is more of a way to avoid the spam filter than something that will cause a surge in the rankings.  Basically, there is no way for the searcher to specify the preference for rich  references in the documents they search. This is a limitation.  Not all queries require a document heavy in citiations, but, with spam, and shoddy content, made-for-AdSense pages, the option to filter documents without citations and refences is a necessary one for true knowledge retrievers.      Document Density (of Topics or Concepts)    Just as contemporary web "Page" based search engines use "keyword density" to estimate or predict the likely relevance of a resulting page, the new system of searching "resources" will be largely influenced by that resources proportion of documents pertaining to the given topic.  For instance, one resource may have fewer overall results, but have results almost completely dedicated to your intended information.  THis is certainly a more time-efficient resource to begin with, and resources of this type make better resources for beginners and experts (perhaps not the same ones, but denser resources will be better in both instances)  Sparse resources will many times actually give more results.  Anyone who's gotten over 1 million Google results for a query probably hadn't made time to go through them all, and probably realizes that this would be useless.  Sparse results mean that although the resource contains results, since it is a miniscule part of its document set, it is unlikely to have well-structured indexes and supplemental indexes pertaining to the topic and is also likely to lack useful query-intent clarification procedures to your specific intent.    The sparse resource may have fewer results, or may have more results.  THe more results usually contain garbage, spam, and "lesser sources", one that probably aren't worth of informational content, but are on-topic nonetheless.  Even if the sparse collection contains truly more "good resources", the fact that they are essentially HIDDEN in the result set with other junk makes it a less valueable resource.  The dense sources have none of these distractions.  The sparse sources may have more 'good resources', but they don't have any classification system, or sub-indicies, or query clarification tools to let a person search with tools that pertain to their domain.  In a sense, sparse sourcss are just one area of a tangled web. The larger tangled web isn't structured conceptually, and neither will be the 'good resource' page results, however numerous they are.   Wednesday, August 30, 2006
Roadmap to Realizing the Open Platform's Promise
We've seen what can happen when powerfully integrated technologies can mash together complimentary data and processing components from various independent sources, and extend the functions to a larger set of capabilities.
We're watching web pages grow up to become services, or to use services. We're seeing web
software operate anywhere.
The networking and (meta)informational structures are both experiencing wide development.
Until now, there has not been serious discussion related to implementing a completely open data store, along with it's open components for serving, retrieving, analysing, and publishing data from the tool set.
The time has arrived for a myriad of open data stores to be supported to become a resource more vast, powerful, and leveling than anything the world has never seen.
What is the Opportunity?
The opportunity is to combine the recent developments in social computing and interoperable document exchange formats to create a scalable document and resource repository to make a large amount of documents available to be structured, tagged, classified and processed by users of the open public resource.
This technique combines the meta-data promised from the semantic web, the community tagging made popular by Web 2.0, as well as the distribution system from the Handle system, and ontological classification systems such as the Ontological Web Language.
Most organizations won't aim to house all of the data that is available.
Even for the ones that attempt to 'grab it all', they won't be able to organize or structure it on their own.
That is why the distributed system for summarizing other resources is so important.
Some groups will attempt the current model of trying to grab everything, (and doing very little human sorting and tagging with that unmanagable amount. The amount is especially unmanageable when it is behind a black box to the rest of the world.) Current data stores suffer from too much data, and not enough knowledge.
Many other groups will find that there are many companies that are actually already pretty good at "getting everything".
They just aren't good at finding everything when you need it. Without structure, organization or even an attempt at classification, it's really a tangled and unmanagable web.
New specialty "search engines" will find their true value-add by concentrating on storing only documents within certain domains of knowledge, or only indexing portions of documents, or structural relationships among documents.
I hesitate to continue the usage of the word "search engine". The term is dubious, a search engine consists of many components including:
Document / Resource Index
*Algorithm for relevance
Algoritm for Sorting by Influence or Authority
* One great way for relevance beyond keyword searching, in fact, the only INTELLIGENT way to do it, is by using human-created conceptually structured Ontologies (Ontological Web Language - OWL)
The major problem of today is that the index is locked down, only documents are given.
The algorithm doesn't work with real conceptual structures, it deals mostly with keyword density.
Thus, authority is still very easy to fake, since it is determined probabalistically.
The engines never know who to trust, or what is what, or what belongs where - so they take votes, and often, the crowd is wrong, and manipulators of the crowd can cheat the results, whether they see it that way or not.
The current "engines" treat the world's knowledge as nothing more than a string of symbols without any kind of anchor to a structured conceptual framework.
The influence algorithms will always remain a vital part of information retrieval technology, and companies such as Google will continue to thrive on their intelligent processing techniques.
What the opportunity is, to level the playing field related to processing large document stores, thusly enabling distributed ontologies to be created that can be applied to large amounts of important documents.
The new service doesn't promise to get every last document in the world. It promises to know what the ones that are important for what topic. Where they belong, what they are about, and what documents are about them.
It also allows anyone to process as much of the raw data in the index as they wish, which they can summarize, and then offer their summaries as part of their own searchable resource.
What are the limitations of the current system?
When google crawls altavista, it treats it like every other web page, based on it's keywords.
Never mind the fact that altavista happens to contain billions of web sites available by plugging in keywords.
This is only treated for how it appears as a "site" to a browser, not how it offers it's "information" to its users.
The same problem occurs with specialized search engines and resources.
Many times, a document is not the best resource for a query, and a resource such as a specialized search engine is an entirely viable return for a query. It may be best to think of these resources as complimentary results that enrich rather than replace the current search tools.
Something like Google won't be able to handle this, because every resource is just another page to them.
Whether you can search 1 document, or 10 billion, google goes by the external pages only.
So, you're missing out. You can't go where Google hasn't gone. Nevermind the fact that Google never really knows where it's going. It doesn't know what it's pages refer to, it could be intended as fiction, or fact, or it could be presented as fact but be otherwise. In any case, it's all just a list of symbols as far as Google is concerned.
"The wisdom of crowds...isn't."
The benefit is for other groups to query large engines with specialized queries, or indexing certain components, and then summarize this data, and make it available to their own searchers (and other data banks).
This can be thought of as pre-processed data.
How would you like your information retrieval oracle to bring you this message?
"We've already search everywhere for what you want, and we've found that these are the places that have the most organized occurences in their query results. SEARCH HERE"
I don't know about you, but thats the kind of information I'm looking to retrieve.
Its the classification, stupid.
What allows this to happen now?
Well, every network is effected by itself as well as other external networks and users.
The web industry has made real strides in software interoperability.
Most of the techniques that comprise the new "open platform" have been known for at least 5 to 10 years, if not much earlier in a more general form.
Hardware directly allows the information to scale, and hardware cost reduction has also allowed more users to come on board.
Someone could write a book on the emerging formats and hardware and user patterns, and what drove what.
I think it's not so important as to who is driving, but where the destination is going to be.
Lets look at this further.
What are the components?
The first compontent is a general index of resources. A resource could be a single document, a small organized group of lists, or a vast index with various levels of structure.
The network architecture for system to synchronize and communication will be implemented by the Handle system, similar to what manages the DNS servers in a non-central way.
Unlike the DNS server, there will be more layers than just the IP address and domain name.
There will be a Topic Index Layer, Various Levels of Index Broker Layers, as well the Site Broker Layer (what we currently think of as a "search engine"), and of course the documents.
This is basically a minimum of 4 tiers, with the index brokers having many levels of organization possible.
(Remember, the classification systems, known as ontologies form another independent portion, as well as the lexicon(s) used by the searcher and ontologists.)
The topic broker is the "oracle" that aims to point you in the right direction, even if that is only to another search engine of search engines. (don't worry, its going to be automatic)
The index brokers will tell you what search engine(s) you should use.
The site brokers will act as a current search engine, but they will be often be evaluated based on their own aggregate scores.
Think of search engines themselves being scored by their own keyword or pagerank scores, based on how good they are at finding certain things. This may include things like relevant document density, (similar to keyword density) as well as document structure, which ensures that the document is not a dangling page, but part of a larger knowledge web.
And, finally, my personal favorite layer, the final end result documents we were looking for all along.
This may sound complex, but it's truly how the open platform has already been planned.
Who is going to do this?
The Open Information Platform will take many players working in concert to achieve it's goals.
The first layer is one or more topic brokers.
These will be the ones who try to get everything, but unlike Google, they don't pretend to be an expert about everything.
If another index knows more than they do, they'll tell you, and actually bring you there automatically.
The index brokers will specialize in topics, or restricted indexing of specific components of web documents, either over a broad or narrow set of documents.
Other index brokers will be indexers of other indexers.
The site brokers act like today's modern search engine, in that they return documents, but they operate with a DEFINITE ontology, with authority (may be custom configurable and come from yet another 3rd party (actually, it would be a 6th party, but who's counting)
This allows these site brokers to return documents based on authoritative models of relevancy, not just keyword heuristics.
Then there are the documents.
I didn't want to leave these guys out.
These guys will interface with the ontologies in both explicit and implicit ways, both on and off their sites.
Parts of the new meta-data exchange will be enhanced by extra tagging, and "trackbacks" that let you know what places index, cache, warehouse, summarize or monitor that document.
This is primilary used for the other engines to get alerted when an update occurs, rather than having to chech 3 times a week like Google does.
Documents, their additional object naming characteristics (documents become entities beyond their simple location.)
A location can always be down, or moved, or abandoned.
The idea of a name or a document being "down", is just as absurd as the concept of "Alexander the Great" being "down" on all areas of the web, like wikipedia and all other engines.
Other ontological information related to the documents will be stored by third party ontology providers that work using the document names and extended properties.
So, to say it again, who are the players?
The architects. They make the standards and solutions. Or, at least the core system of protocols to communicate the objects of the repositories, their names, and the locations of their extended data and relationships.
Actually extending the data and the relationships among objects is the domain of the ontologists.
So, there are many layers, including the Network, the ontologies, as well as the interfaces, and various processing and sorting algorithms for relevance, influence and credibility.
The topic and index brokers. They go for everything within their domain, and specialize in redirecting people to the best resources, not the best documents.
As well, this new open platform will give rise to the potential to process, research, summarize, map, maintain and support the ever-increasing wealth this information can bring when made open.
Effects on the World :
This technology and it's integration will do more than transform the world of "IT", or the "internet".
Its scope will ripple to reach every corner of the earth.
Time is inherently more valuable to someone having access to this toolset.
This is a large change in the relative value between time and capital, or other physical property.
Time will continue to leapfrog in value, and not the least for the underprivilaged.
To put it simply, the implementation of this Idea is the Trickle Down theory of Knowledge promulgation.
Nothing, and I mean nothing will do more to close the digital divide, than this format.
Forget about putting computers in schools and villiages.
When this open platform reaches it's inflection point, the hardware will invest in itself.
This concept is truly the Reaganomics of philanthrophy through growth in to economies rich and poor.
No other media format will level the world as much as this.
And our society will never be quite the same.
Propagation Strategy.
So, who are the early adopters? How do we deal with issues of existing organizations not wanting to lose their competitive edge using their closed networks?
There are many groups who need to become more aware of how close we are to achieving this open system of Information Retrieval.
They are:
The Bloggers - leading the front line in Tech, Business and Politics. These are the scouts and rangers of the Movement.
The Academic world - This group has engineered the infrastructure and will process and shape it's progendy. They are the research labs.
The Corporate World - The ones who brought all the components together, and will eventually adopt and integrate this system to their offerings, either as a server, or client, or a complex blend. (The prosumer, to use a Alvin Toffler word.)
Government - You guessed it, the State. No one can invest in infrastructure like the Government. And, no one stands to gain more from the industry this platform will incubate.
There are also other groups, such as non-profits, non-government organizations, philanthropic organizations, as well as political groups who can benefit from this framework.
Challeges -
The very naming of a concept this vast is highly problematic.
Saying that it is an upgrade to the "web" is quite demeaning, because the web only applies to HTML documents made primarily for display purposes.
Terms like web 2.0, semantic web, social computing, active networks, distributed indexing may talk about one benefit, or allude to an evolution of sorts, but they all fall short in describing it.
Neural Name Net ?
Hey, it's got something going for it.
Everyone of the terms has something going for it, but misses others.
Other challenges will be dealing with resistance from existing providers of information retrieval technology, and large data warehouses like myspace, google, amazon, and all the rest of the capitalist machines.
Personally, my latest codename for the project is S.C.O.P.E -
Social Computing Open Protocol Environment
It also has a meaning as in Scope, the tool that one uses to scope.
And, the act of scoping.
As well as the scope of what is being surveyed, as well as the grammatical idea of scope, which determines the scope of a modifier - this is similar to meta-data.
[I had originally made a reference to the parable of david and goliath, and tried to change the story to fit my needs..
It turns out this was ill-conceived after a marathon blogging session.
The correct parable is the Tower of Babbel, for humanity to build a tower to reach the heavens.]
One one side is The old way of communications, with his war chest of cash, his data, his algorithms, his designers, his mathematicians, his venture capital ties, his investment banking clout, armies of shareholders behind him.
It would seem like nothing could stop this process.
But look, just next to every company is another company, with just as much cash, data, algos, and just as many people supporting his reign.
Now it seems there are a sea of companies, all seeking to take from one another, or to buy them out outright.
It would seem that nothing could get in the way of the mega-companies that combine the great Golliaths of the past.
But, in the distance, without anyone listening or paying attention, is the linguist, having nothing but his own language for cooperation.
He gives nothing but a whisper to the meekest of groups, and at that moment, every company will now either have to speak the new Language of Collaboration, or be history.
Let knowledge trickle down and let ye breathe wisdom.
MR

