Describing ‘Red’ To A Blind Man: The Dilemma Of Ontology

Recently asked on Yahoo! Answers:

How to map a keyword to a category?
I have a bunch of general categories, such as: Games, Modeling, Business, Finance, etc.

My question is, how can I take a keyword such as “Xbox 360” and automatically map it to a category such as “Games”? This is an easy example, and I did figure out ways to do this, but if I take a harder example such as “Tyra Banks”, I am unable to map it to a category of mine such as “Modeling” or “Tv Shows”.

I have been thinking of this for a very long time, and I can’t come up with a concrete solution. I have also searched the web and found nothing that would provide this service.

Any ideas?

Basically, computers develop semantic understanding the same way we people do, only far less efficiently: Repetition of example. (In computer science, semantics are called ontology.)

The reason we are far more efficient is that we more easily create connections between things, and more quickly process those connections, than computers can manage.

For example, suppose I give you three categories: colloid, coagulant, polymer. Now, place “chocolate milk” into the proper category.

The problem for most people is, although they’re 100 percent aware of what chocolate milk is, they don’t know what colloids, coagulants or polymers are. (If you do, pretend you don’t and bear with me.)

Chocolate milk is a colloid; that is, a uniform distribution of solids in a liquid.

Colloid is a word most people have never heard, so it has no meaning (semantic) associated with it in most people’s minds.

However, now that you know what a colloid is, you can very quickly come up with two more: Pepsi and coffee. They, too, are uniform distributions of solids inside a liquid base.

You can make this association because even if you don’t drink them regularly, you’ve at least seen Pepsi and coffee on thousands of occasions. You are intimately aware of their properties and you can easily define how they are similar.

So because you know literally everything there is to know about Pepsi, coffee and chocolate milk, when I tell you that chocolate milk is a colloid, you automatically know that Pepsi and coffee are, too.

However, you also know that chocolate milk, coffee and Pepsi are not exactly the same thing. For example, you know that Pepsi is most often served cold, coffee is most often served hot and chocolate milk, when not served cold, is often called “cocoa.”

These are the kinds of relationships that are brutally difficult for computers to make. And it is difficult for many programmers, such as the questioner above, to get computers to handle this depth of semantic, in large part, due to our own inability to forget everything we know about everything.

Ontology Today

The most common form of semantic understanding in computer programming is frequency. That is, “how many times does Jesus occur in the New Testament?” If it’s a lot, then clearly the New Testament is about Jesus.

This is the basic methodology employed by search engines: If the word or phrase you seek appears often, and in the context of grammatically correct sentences, on a Web page, then the Web page is probably about the word or phrase you entered; if the URL, title or description of the page contains that word or phrase, then it’s even more likely the page applies to your search term.

(There’s more to weighting a search query than that, but let’s keep it basic for simplicity’s sake.)

I’m no math whiz, so I can’t really explain (or fully understand) the methodologies used by, say, the full-text or natural-language query searches that most modern databases can perform.

The interesting thing is that as time has gone by, Google and Yahoo! are more likely to return Web pages that contain what you meant to find, rather than what you specifically asked to get; today’s search engine results are a far cry from the results you’d get not even five years ago.

Of course, some of that is better, faster machines and more experienced programmers. But a lot of it has to do with the copious amounts of data now available to make the connections between what I asked for, the links I got and what I wound up clicking.

In other words, because there are so many Web searches for so many different terms, and Google / Yahoo! can analyze what people click on once they get their results, they can “teach” their systems how to make better semantic connections between search words and what people meant when they used them.

Forget Everything You Know, Then Ignore Everything You Learn

The problem for the programmer is that almost always, any new leap into ontology begs the question.

That is, you can’t get past the fact that you already understand the relationship between the words you are trying to link. You can’t forget them or even ignore them, no matter how hard you try, because pretty much the only thing your brain does is semantics.

Because you know the relationship, trying to establish how to create it becomes exceedingly difficult. As is the case with our drinks example, relationships often become complex and depend on variables that are in flux or are matters of understanding or circumstances — understanding and circumstances that often are as difficult to define as the relationship you’re trying to create.

The benefit of having voluminous amounts of data, such as Yahoo! and Google have at their disposal, is they can provide initially simplistic relationships, toss them into the lake, see which ones swim, and narrow the next set of results by promoting the ones that swam best.

With small amounts of data, such as this questioner has at his disposal, it’s nearly impossible to instruct a computer on how to make a relationship. The amount of data you’d need to use, in order for the computer to be able to come up with anything approaching reliable results, would be so overwhelming as to entirely defeat the purpose of writing the algorithm in the first place.

4 Comments

  1. Thanks for taking the time to give your view on the this puzzling question.

    I agree on what you are saying. But it seems like you have took the question and turned it around. What you have explained is taking a category and mapping it to keywords. That is not really what I wanted. Sure, it’ll be hard for the system to figure out that a colloid i in milk, pepsi, coca cola, and more. It seems like you’ve taken it a bit too far.

    What I was trying to do is figure out a way to take a keyword, and then find the most relevant category. And that IS possible, but it does require data. I should have probably asked, what kind of data that we find on the internet can we use in order to get the most accurate categories for the biggest amount of keywords. Or possibly what collection of data.

    My original idea worked, sort of. And it is what you sort of explain. The idea was to use data from search engines. Take a keyword such as “banking” and run it across the top 100 sites listed on Google. Since Google is a good source for relevant information, you can bank on the fact that they will provide you mainly with accurate top 100 results.

    By indexing all those top 100 documents, you’ll then need a separate technology that will get all the related keywords to your main keyword.

    So let’s say that you now got your top 10 related keywords which would be something like:

    Music
    Mp3
    Portable Device
    Computers
    Apple

    You’d be sure to get the top most repeat keywords / keyword phrases (and that would require yet another technology), that do not contain the word “IPOD”.

    Now you would go back to your list of let’s say – 500 categories. Each category would need to have related keywords. So the category “Music” would need to have keywords that would be “music,mp3,rap,hiphop,pop,rock,etc”.

    My idea was to now run the related keywords that you found against the related keywords of the categories. The category that has the most matches would take the spot as being the most relevant category.

    The reason I liked this system was because if you have a new product that just came out, for instance let’s say IPOD was just announced yesterday – you now know that Google will definitely have it listed.

    And since the categories would contain terms that could be found in the English dictionary, then you don’t need to worry about having category related terms of newly released products, therefore the categories don’t need updates, or possibly only a few updates every few years.

    The only service you’d rely on is Google to get you the recent and relevant web pages for your initial search term.

    But at the end of the day, this is too hard. And once again, I fall into a problem where I will be able to map the keyword “Ipod” or “Ipod Devices” to “Music” category. But, once again, how do I map something to the category “Recreation”? It’s not the same ball game again. And, how can I map something like “United States” to category “World”? I mean, yes I can map it, but that would now require me to make specific rules for countries / cities / states to map to a “LOCAL” category.

    The question is, how can we use the data that is out there to build such a system?

  2. Whoops, I didn’t mean keyword “banking” in that post, I meant “IPOD”. Not sure how I made that mistake. 😉

    And what I meant to say is that after indexing the top 100 documents, you’d scan the content of these documents to get your relevant keywords.

  3. Read the entry again. My intent was to explain the programming problem and expose why it’s so difficult to do what you want to do; not to provide a solution. In that sense, your follow-ups repeat my points.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • Check out the Commenting Guidelines before commenting, please!
  • Want to share code? Please put it into a GitHub Gist, CodePen or pastebin and link to that in your comment.
  • Just have a line or two of markup? Wrap them in an appropriate SyntaxHighlighter Evolved shortcode for your programming language, please!