Recently asked on Yahoo! Answers:
How to map a keyword to a category?
I have a bunch of general categories, such as: Games, Modeling, Business, Finance, etc.
My question is, how can I take a keyword such as “Xbox 360” and automatically map it to a category such as “Games”? This is an easy example, and I did figure out ways to do this, but if I take a harder example such as “Tyra Banks”, I am unable to map it to a category of mine such as “Modeling” or “Tv Shows”.
I have been thinking of this for a very long time, and I can’t come up with a concrete solution. I have also searched the web and found nothing that would provide this service.
The reason we are far more efficient is that we more easily create connections between things, and more quickly process those connections, than computers can manage.
For example, suppose I give you three categories: colloid, coagulant, polymer. Now, place “chocolate milk” into the proper category.
The problem for most people is, although they’re 100 percent aware of what chocolate milk is, they don’t know what colloids, coagulants or polymers are. (If you do, pretend you don’t and bear with me.)
Chocolate milk is a colloid; that is, a uniform distribution of solids in a liquid.
Colloid is a word most people have never heard, so it has no meaning (semantic) associated with it in most people’s minds.
However, now that you know what a colloid is, you can very quickly come up with two more: Pepsi and coffee. They, too, are uniform distributions of solids inside a liquid base.
You can make this association because even if you don’t drink them regularly, you’ve at least seen Pepsi and coffee on thousands of occasions. You are intimately aware of their properties and you can easily define how they are similar.
So because you know literally everything there is to know about Pepsi, coffee and chocolate milk, when I tell you that chocolate milk is a colloid, you automatically know that Pepsi and coffee are, too.
However, you also know that chocolate milk, coffee and Pepsi are not exactly the same thing. For example, you know that Pepsi is most often served cold, coffee is most often served hot and chocolate milk, when not served cold, is often called “cocoa.”
These are the kinds of relationships that are brutally difficult for computers to make. And it is difficult for many programmers, such as the questioner above, to get computers to handle this depth of semantic, in large part, due to our own inability to forget everything we know about everything.
The most common form of semantic understanding in computer programming is frequency. That is, “how many times does Jesus occur in the New Testament?” If it’s a lot, then clearly the New Testament is about Jesus.
This is the basic methodology employed by search engines: If the word or phrase you seek appears often, and in the context of grammatically correct sentences, on a Web page, then the Web page is probably about the word or phrase you entered; if the URL, title or description of the page contains that word or phrase, then it’s even more likely the page applies to your search term.
(There’s more to weighting a search query than that, but let’s keep it basic for simplicity’s sake.)
I’m no math whiz, so I can’t really explain (or fully understand) the methodologies used by, say, the full-text or natural-language query searches that most modern databases can perform.
The interesting thing is that as time has gone by, Google and Yahoo! are more likely to return Web pages that contain what you meant to find, rather than what you specifically asked to get; today’s search engine results are a far cry from the results you’d get not even five years ago.
Of course, some of that is better, faster machines and more experienced programmers. But a lot of it has to do with the copious amounts of data now available to make the connections between what I asked for, the links I got and what I wound up clicking.
In other words, because there are so many Web searches for so many different terms, and Google / Yahoo! can analyze what people click on once they get their results, they can “teach” their systems how to make better semantic connections between search words and what people meant when they used them.
Forget Everything You Know, Then Ignore Everything You Learn
The problem for the programmer is that almost always, any new leap into ontology begs the question.
That is, you can’t get past the fact that you already understand the relationship between the words you are trying to link. You can’t forget them or even ignore them, no matter how hard you try, because pretty much the only thing your brain does is semantics.
Because you know the relationship, trying to establish how to create it becomes exceedingly difficult. As is the case with our drinks example, relationships often become complex and depend on variables that are in flux or are matters of understanding or circumstances — understanding and circumstances that often are as difficult to define as the relationship you’re trying to create.
The benefit of having voluminous amounts of data, such as Yahoo! and Google have at their disposal, is they can provide initially simplistic relationships, toss them into the lake, see which ones swim, and narrow the next set of results by promoting the ones that swam best.
With small amounts of data, such as this questioner has at his disposal, it’s nearly impossible to instruct a computer on how to make a relationship. The amount of data you’d need to use, in order for the computer to be able to come up with anything approaching reliable results, would be so overwhelming as to entirely defeat the purpose of writing the algorithm in the first place.