Google released some research data that they have been using:  It is a list of all of the words in all of the books that they have scanned.  See them here: http://ngrams.googlelabs.com/

I am constantly faced with a major problem on Rhymebrain: Since I get the results from importing text from all over the web, many words in rhymebrain are not really words, and it is filled with spelling mistakes.

There is no standard dictionary that contains all of the words of English. For one thing, people make up words all the time. They verbify nouns, and noun-ify verbs. Can you pinkify your wardrobe? Sure you can! If a signer signs a document, what does the document’s signee do? I have no idea, but people have used this word thousands of times in the last hundred years.

Google’s data has problems too. In particular, it is filled with errors from the scanning process. For example, the word cr6dit appears very often, because the letter e on a printed page sometimes looks like a 6 to dumb computer software that doesn’t know any better.

There is a bright side:  The Google data has 3 billion words and a count of how many times they occur.  Maybe the misspelled words will be eclipsed by the correct ones.

Right now, I am running the data through a program that I wrote to try to figure out if I can use it to enhance the Rhymebrain results.

About these ads