Rover: Our Android Submission

We finished up our submission for the Android Developer’s Challenge this past weekend. The project is called Rover, and you can read more about it at our new site dedicated to the project: http://www.roverproject.com. We were a bit disappointed that we were not able to include everything we had planned, but that’s the way it goes sometimes. You can see a few sample tags we uploaded from the emulator over on roverproject.com to give you an idea of where we can go with this project. We have lots of ideas but little time these days so for now we’ll see what happens with the competition and proceed from there.

The mobile platform is definitely new and different from anything we’ve worked on in the past, and it will be exciting to see how well Google and the Open Handset Alliance are able to market it considering the level of competition they will face. In the end, though, the opposing price tag may make Android the platform of choice.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]
Posted at 04/16/08 15:19 | no comments | Filed Under: Uncategorized

Token Heuristics: Notes on TagHelper

While working on TagHelper we faced some interesting issues on how to generate useful statistics for the tokens we want to consider when suggesting tags for a document. Many of these are issues that must be considered when doing any kind of language processing, but I thought I would discuss some of these issues with document tagging in mind in case this is helpful to other document parsers out there.

The first issue is to be able to discard tokens that are too general to be likely to describe the document. Usually this is handled with a dictionary lookup of common words, which in English would include words like “a”, “the”, “is”, etc. Pulling the top 500 or so most commonly used English words and doing some manual filtering on that set can yield a decent starting position. Most words that are not going to be relevant to a document tend to be short, which helps as well as this narrows the focus of the filter tremendously. Problems start to arise, however, when considering words that in some instances may not contribute to the meaning of the document but in other cases do. One of example of this (once again in English as this is my first language) is the word “will”. If we use wiktionary as a reference we see that “will” has two primary meanings. The first meaning is to indicate intent, which is probably the most common usage and as such a word that would likely be filtered from most documents as being a potential token relevant to tagging. On the other hand, it can also reference a document used to convey the wishes of a deceased individual. This second meaning may be pertinent if the document is, say, an article on “Will Writing” or “Why did I get left out of my parent’s will?”. Obviously there are contextual clues that can help identify the meaning of a word, but storing these associations can be unwieldy in order to bootstrap a word association system. One approach would be to build the capability to associate words with their context into your parsing engine. This would then allow you to determine the relevance of a word based on the other words found in the document. This can also be tricky, though, since an article on “Will Writing” may have words like “death”, “parents”, “money”, “funeral”, “grammar”, etc. and yet be an article about a guy named “Will”, who happens to like to write about death. Then again, maybe we shouldn’t care about those cases and just give a probability that a word has been used in a predefined context (e.g. maybe the article is about Will writing about death, but it’s OK if we misinterpret that as an article about writing wills). Either way, you have to accept that you will be wrong some percentage of the time, and if you’re already willing to make that sacrifice does it really matter if you take context into account for this case at all? The simple answer is that context is going to be important more often than not so you should go ahead and plan to use it in your analysis regardless of whether you use it initially or not. Any kind of interface that is making suggestions or recommendations to a real person is going to have flaws (at least at first) so it makes sense to focus on the parts that are most annoying to your users as long as you don’t limit your possibilities for future extensibility. For this reason, our initial release did not take context into account; although it is capable of making contextual associations, it does not have any such rules at the moment.

The next issue is to determine what kind of statistics to collect on tokens. After filtering out innocuous tokens as discussed in the previous section, simple word frequency analysis should give a quick rough estimate of the relevance of certain tokens in the document. If this is the only statistic collected, however, two important issues will come up fairly quickly. One is that you may end up with a lot of words that appear the same number of times on a page, and how do you rank those words? One idea is that words that appear between certain tags may be more important than others. For instance, words that appear in the title, headings, are bold, or italic may be more important than those that appear in a regular font. Often, words that appear toward the top of a page may be more relevant than words that appear at the bottom of a page. The problem with any approach that attempts to discern meaning from tag association or document placement is that they will need to keep context in mind based on how the document may have been generated (more on this below). Another problem with the single statistic method, however, is that the more tokens you use to infer a documents intent or purpose, the more polluted your suggestions are going to become. This means you need to have a statistical method that tries to render as few tokens as possible to use to generate suggestions in order to keep your suggested tag list both manageable and relevant. For many documents, a simple token frequency statistic will generate a fairly uniform distribution, which may yield a useful rank for your tokens, but may make it difficult to determine which tokens are actually useful for generating suggested tags.

Dynamic content generated by scripts, Flash, etc. can be very difficult to use unless you have a document rendering engine sitting on your server, which we do not unfortunately. Google appears to be able to make use of dynamic content quite effectively when indexing documents and would probably be able to generate the best tag recommendation engine around using their database of information. For now, our system only makes use of data provided in the document itself until we get more funding to do something fancier :).

User contributed content (such as in forums, blogs, etc) can make it difficult to separate what the page is about from what the user’s contributed text is about. News sites are the prime example of this case. These sites have new content on a variety of subjects appearing quite frequently that, if parsed, may yield wildly different results for the suggested tags for that page. One way to sidestep this issue is to say that a news site is probably sufficiently generic so as to not require many tags (e.g. news, politics, blog). This is not an acceptable solution, however, since there are general news sites (e.g. CNN), and there are more specific news sites (e.g. Slashdot, which caters to a slightly more specialized audience). It would be useful to be able to distinguish the focus of the contributed content and not just gloss over it as random text with the potential to further pollute your target tag set. We found that in most cases using analyzing contributed text alongside text from the rest of the page actually helped the suggestion engine. This greatly simplifies the problem but does not mean there won’t be cases where contributed text creates a lot of random noise that would be better to filter out.

The biggest thing to keep in mind here is that you can’t just remove all the tags in a document in order to analyze the raw text. Context is an important part of the analysis, and often the tags can provide important indicators for interpretation even if true understanding is not the goal.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]
Posted at 01/18/08 22:43 | no comments | Filed Under: Tagging

TagHelper: Jumpstart Your Tagging

TagHelper (http://www.taghelper.com) was recently launched as an experiment in helping document writers tag their documents. Sometimes tagging is obvious, but other times you need more than a thesaurus to adequately define your text in 10 words or less. TagHelper can pull a document from a provided URL or it can tag text entered directly on to the page. Suggested tags are linked to del.icio.us in case you’re curious about popular articles targeted toward the same tags that your document is associated with.

It can also be interesting just seeing what TagHelper will come up with for random documents. Of course, it’s not always right, and sometimes it comes up with some odd suggestions, but that’s part of the learning process. We hope to refine TagHelper so that its suggestions are more effective and more helpful to document writers so check it out at http://www.taghelper.com, and let us know what you think by commenting below.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]
Posted at 01/17/08 22:13 | no comments | Filed Under: Products, Tagging

ChartMake: A simple way to create and store charts online

ChartMake (http://www.chartmake.com) gives users a simple interface to Google’s chart API using the gchart jquery plugin. Charts can then be saved to be accessed via a convenient URL mechanism (e.g. http://www.chartmake.com/chart/?id=5) with no login or password required! So browse on over to http://www.chartmake.com and start creating your charts!

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]
Posted at 01/08/08 22:06 | no comments | Filed Under: Products

SnipQuote Launched

We just launched SnipQuote, a site that makes it easy to save and share your favorite quotations. Using the SnipQuote Firefox plugin or bookmarklet, you can simply highlight text on any web page, then click SnipQuote to tag and submit your quote. Your quotations will be instantly saved to your user page and be shared with the world!

SnipQuote allows you to subscribe to quotations using RSS,  Facebook, or our Google Gadget. Check it out!

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]
Posted at 12/18/07 22:12 | no comments | Filed Under: Products

About

Tragic Phantom Productions is a small software company that creates social web applications that are simple, well-designed, and useful.

Categories