Thursday, October 15, 2009

Map/Reduce & the Mechanical Turk

So, I have a project that I have wanted to get off the ground for a long, long time, which involves people solving a problem that computers seems incapable of – making sense of the entertainment industry’s metadata mess. It’s a disaster, no one can match anything to anything with any degree of confidence, and even companies whose raison d'ĂȘtre is value-added metadata don’t seem capable of getting it right. I don’t entirely blame them, as having worked at the sharp end for a number of years I know how difficult it is.

Except that it isn’t really – at least not for humans. Computers can’t do it because IDs don’t match and there’s very little fixed structure. It’s a schema-less nightmare. The only significant effort I’ve seen at creating a universal schema was hopeless. (Unfortunately I was supposed to be managing it at the time!) And yet it’s quite easy to match assets to metadata across formats (digital, physical etc.) as a human. We can match images and sounds, we can do loose / fuzzy text matches, and above all we have common sense.

The problem for us people is the scale of the problem – tens of millions of assets need matching – which superficially appears best-suited for tackling programmatically. So how can we reconcile the requirement for human intervention with a problem of vast scale?

This is a map/reduce problem at its heart – we need to spread the work across as many people as we can, and then aggregate the results. Is the Amazon Mechanical Turk the solution?

Adding in the spice of having no fixed schema (what happens to your precious database when the music industry decide to create a new product type that looks a bit like an album, but different) and it’s a problem for the NoSQL generation.

So here’s my solution – stick all the available data into a non-relational document store, index with a search engine, and then present a simple user interface to allow people to validate the metadata and to perform the all-important matching process. Finally, motivate people to do the work by paying them, and use the Mechanical Turk to manage the human map/reduce function.

Some kind of validation is required to maintain the data quality (only accept matches provided by multiple people?) – who knows, perhaps if enough people join the labels / studios themselves might get involved to officially endorse the work (think Twitter verified accounts.)

All I need now is someone pay to have it done…

[UPDATE]

I’ve just done my first couple of HITs (Human Intelligence Task) – looking up iTunes AudioBook prices for someone – hopefully I now have $0.04 winging it’s across to me. Here’s a screencast of me in action! http://screencast.com/t/GazhIehoEW

No comments: