I’ve finished my analysis of Google Books N-grams raw data and incorporated 2.6 million words into RhymeBrain. This is an increase of 10 times.  (RhymeBrain Word List is here)

Most of the words are OCR garbage, so it forced me to come up with a better algorithm for eliminating garbage words. With the Google data, for any given word (even “orange”) the algorithm comes up with thousands of words. Now, the list is whittled down into 25 or so by taking into account both RhymeRank(TM) and log(frequency). The user can click on a button to load up to 400 results.

There is a trade off. I collapsed the historical Google Books data from all years. Perfectly legitimate Words like “shutterbug” then have a very low frequency, since they were recently invented.

On the implementation side, the word tree grew to 90 MB which is too much to be loading in for each query. Now the tree is mapped into memory using the mmap() system call, resulting in average response times of 60 ms on my sock-drawer data center, where rhymebrain.com is hosted:


A minor tweak is to add “Consider using these near-rhymes or slant-rhymes” to the result pages. This is hip-hop jargon, as I learned from the B-Rhymes blog: http://www.b-rhymes.com/2010/01/slant-rhymes-or-near-rhymes/

About these ads