Token Heuristics: Notes on TagHelper

While working on TagHelper we faced some interesting issues on how to generate useful statistics for the tokens we want to consider when suggesting tags for a document. Many of these are issues that must be considered when doing any kind of language processing, but I thought I would discuss some of these issues with document tagging in mind in case this is helpful to other document parsers out there.

The first issue is to be able to discard tokens that are too general to be likely to describe the document. Usually this is handled with a dictionary lookup of common words, which in English would include words like “a”, “the”, “is”, etc. Pulling the top 500 or so most commonly used English words and doing some manual filtering on that set can yield a decent starting position. Most words that are not going to be relevant to a document tend to be short, which helps as well as this narrows the focus of the filter tremendously. Problems start to arise, however, when considering words that in some instances may not contribute to the meaning of the document but in other cases do. One of example of this (once again in English as this is my first language) is the word “will”. If we use wiktionary as a reference we see that “will” has two primary meanings. The first meaning is to indicate intent, which is probably the most common usage and as such a word that would likely be filtered from most documents as being a potential token relevant to tagging. On the other hand, it can also reference a document used to convey the wishes of a deceased individual. This second meaning may be pertinent if the document is, say, an article on “Will Writing” or “Why did I get left out of my parent’s will?”. Obviously there are contextual clues that can help identify the meaning of a word, but storing these associations can be unwieldy in order to bootstrap a word association system. One approach would be to build the capability to associate words with their context into your parsing engine. This would then allow you to determine the relevance of a word based on the other words found in the document. This can also be tricky, though, since an article on “Will Writing” may have words like “death”, “parents”, “money”, “funeral”, “grammar”, etc. and yet be an article about a guy named “Will”, who happens to like to write about death. Then again, maybe we shouldn’t care about those cases and just give a probability that a word has been used in a predefined context (e.g. maybe the article is about Will writing about death, but it’s OK if we misinterpret that as an article about writing wills). Either way, you have to accept that you will be wrong some percentage of the time, and if you’re already willing to make that sacrifice does it really matter if you take context into account for this case at all? The simple answer is that context is going to be important more often than not so you should go ahead and plan to use it in your analysis regardless of whether you use it initially or not. Any kind of interface that is making suggestions or recommendations to a real person is going to have flaws (at least at first) so it makes sense to focus on the parts that are most annoying to your users as long as you don’t limit your possibilities for future extensibility. For this reason, our initial release did not take context into account; although it is capable of making contextual associations, it does not have any such rules at the moment.

The next issue is to determine what kind of statistics to collect on tokens. After filtering out innocuous tokens as discussed in the previous section, simple word frequency analysis should give a quick rough estimate of the relevance of certain tokens in the document. If this is the only statistic collected, however, two important issues will come up fairly quickly. One is that you may end up with a lot of words that appear the same number of times on a page, and how do you rank those words? One idea is that words that appear between certain tags may be more important than others. For instance, words that appear in the title, headings, are bold, or italic may be more important than those that appear in a regular font. Often, words that appear toward the top of a page may be more relevant than words that appear at the bottom of a page. The problem with any approach that attempts to discern meaning from tag association or document placement is that they will need to keep context in mind based on how the document may have been generated (more on this below). Another problem with the single statistic method, however, is that the more tokens you use to infer a documents intent or purpose, the more polluted your suggestions are going to become. This means you need to have a statistical method that tries to render as few tokens as possible to use to generate suggestions in order to keep your suggested tag list both manageable and relevant. For many documents, a simple token frequency statistic will generate a fairly uniform distribution, which may yield a useful rank for your tokens, but may make it difficult to determine which tokens are actually useful for generating suggested tags.

Dynamic content generated by scripts, Flash, etc. can be very difficult to use unless you have a document rendering engine sitting on your server, which we do not unfortunately. Google appears to be able to make use of dynamic content quite effectively when indexing documents and would probably be able to generate the best tag recommendation engine around using their database of information. For now, our system only makes use of data provided in the document itself until we get more funding to do something fancier :).

User contributed content (such as in forums, blogs, etc) can make it difficult to separate what the page is about from what the user’s contributed text is about. News sites are the prime example of this case. These sites have new content on a variety of subjects appearing quite frequently that, if parsed, may yield wildly different results for the suggested tags for that page. One way to sidestep this issue is to say that a news site is probably sufficiently generic so as to not require many tags (e.g. news, politics, blog). This is not an acceptable solution, however, since there are general news sites (e.g. CNN), and there are more specific news sites (e.g. Slashdot, which caters to a slightly more specialized audience). It would be useful to be able to distinguish the focus of the contributed content and not just gloss over it as random text with the potential to further pollute your target tag set. We found that in most cases using analyzing contributed text alongside text from the rest of the page actually helped the suggestion engine. This greatly simplifies the problem but does not mean there won’t be cases where contributed text creates a lot of random noise that would be better to filter out.

The biggest thing to keep in mind here is that you can’t just remove all the tags in a document in order to analyze the raw text. Context is an important part of the analysis, and often the tags can provide important indicators for interpretation even if true understanding is not the goal.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

About this entry