Monday, 30 March 2009

The Unreasonable Effectiveness of data

Interesting article by some Google folks about the effect that having access to a really, really huge text corpus has on machine translation. They argue that as the corpus of available documents gets bigger, you need less and less structure in your training data, and less complex algorithms to make use of it.

Every ontology - a formal structure representing a set of concepts - is a treaty between people who have agreed to use it, but as the scope gets larger this gets ever more politically fraught, never mind the complexity issues. They reckon the semantic web (in which web content is marked up according to an ontology to make it more machine-legible) is likely to be surpassed at the starting gate by something 'dumber', but informed by a trillion-document corpus.

There's an entertaining bit at the end, where they posit that these statistical tools may actually help schema/ontology designers with semantic auto-complete. You start creating a table named CARS with columns MAKE and MODEL, and (because it knows these concepts fly together in the corpus) it could auto-suggest columns such as YEAR and COLOUR.