Pre – Prologue
Months ago, there was a twitter account conducting a contest where a scrambled word was tweeted; challenge was to be the first to reply with the correct answer. A new word was posted every hour and to score more points, one had to be watching twitter all the time. It seemed a perfect task for automation(sigh!). It didn’t take much time in hacking code to poll, read and reply to the tweet but I needed a word unscrambler. Being a ‘Hacker’, I didn’t want to write code for that. There were websites that did it but lacked APIs. I plugged in SimpleHTMLdom which provides jQuery like syntax for ‘scraping’ data from websites and my hack was ready.
Few days back during lunch time chat, I asked a friend if he could build an app that could return the proper sentence when given a scrambled sentence as input. This reminded me of the ‘twitter reply’ hack I had written months ago and thought I could write code for ‘unscrambing’ words. Just then, I was re-discovering Google App Engine with python so chose it for solving(showing off?!) this.
Straight forward solution to this problem was finding all permutations of the characters in the scrambled word and then doing a dictionary match. There are lots of links on the web that provide word lists. I used one such word list for the dictionary. itertools.permutations and a dictionary-search loop later, I had the required results. This worked fine for short words but as the length increased, it took more and more time.
This solution wasn’t usable; waiting for seconds(and minutes) just to find all the possible permutations is simply stupid(more so if it’s a web application).
Next challenge was to bring down the time taken; somehow convert this exponential(?) problem to linear. Solution to this was simple. For a given word and all other words formed by scrambling the characters in it, the count(number of repetitions) of each character in the word remains the same.
For example: In both god and dog, the characters g,o,d are repeated once.
god => a=0; b=0; c=0; d=1; …. g=1;…. o:1;….
dog => a=0; b=0; c=0; d=1; …. g=1;…. o:1;….
All I had to do was create a data structure that held the character counts (of characters) for each word in the word list. This data structure went into a database and finding unscrambled words became as simple as firing a single query with a where clause to match the character counts. Thus time consuming permutation-generation part was ousted but at the price of storing this new data structure. Time complexity reduced at the cost of Space complexity.
word count integer representation
Fròm then on, it was straight forward coding, uploading and (long, tiresome, boring) waiting for the cron jobs to run for dataset generation+ insertion in GAE datastore. The app is called wordunscrambler.
As I was storing the character counts as integers in the datastore, it occupied lot of space. Though this wasn’t really a serious issue, I wanted to reduce this space complexity. Solution to this was based on an (very apt)assumption that the character count in a word would not be greater than 9 and it could be represented by a single decimal number(Could have also chosen hexadecimal but I love decimals). The character counts were now represented as a string; the first character of this string represented character count of character ‘a’, second character corresponded to ‘b’ and so on.
word count string representation(second character represents ‘a’ here)
26 integers(which are stored as long in GAE datastore) representing a word’s character-count were now replaced by a 26 character string(unicode). Thus, Space complexity reduced. The site as of now runs this code.
By the way, integer representation of character count has an advantage over string representation in that if we need to find all possible words(taking ‘any’ number of characters at a time) that can be formed fròm given characters, the query could use a ‘lesser than’ in the where clause(instead of character code matching).
PS1: As this was more of a let-me-find-a-solution exercise, google’s help was not taken.
PS2: You could also call this ‘Anagram finder’.