The text-fill problem

Some programmimg notes on filling a crossword grid with words from a dictionary

 

When I first started on this, I didn't realise that this is a familiar problem which has taxed the brains of many before me. My first efforts were therefore uninformed, but people might be interested how somebody with no prior knowledge approached the problem. The ideal algorithm, in my view, would have the following properties:

Here's a brief history of how different versions of XWord Studio tackled the problem.

Version One

Work out all the word-slots or 'spaces' - store in an array of Space data structures. Find longest, and fill with a random word from the dictionary. Find next longest - intersecting if possible, and try to fill. If you can just continue and fill them all! If you can't fill a space, go through all its intersecting spaces to find the most likely cause of a mismatch. This is based on a table of letter frequencies (if you've got a zed that's a likely culprit) but also try and leave longer words in situ. Erase this word and try again. This worked fine if a) you had a huge word-list and b) you didn't have too many 'white' areas, in which case it would get into a rut - a common problem.

Version Two

When I rewrote the program I thought a lot harder about the problem. I kept the one above calling it 'Random autofill' but did some testing. To my surprise, I got just as good results if I ignored several of the constraints, eg finding _intersecting_ spaces, using the frequency table and bias towards shorter words. I also wanted a more methodical approach to cater for grids with lots of white space (eg empty grids), hence the 'Logical' fill. This establishes a chain of spaces, inserts a provisional word, and only advances to the next if all this word's intersecting spaces could be filled. Otherwise it tries the next available word. If we run out of available words, backtrack. But although this would always find a solution were one possible, it could take an inconceivable length of time to find it - it just wasn't smart enough.

It was about this time I came across the 'Programmers' Challenge', a competition run by MacTech magazine, noting that one month this problem was the challenge. A description, and code for the winning entry is here (scroll down to 'previous winner'). This was a tremendously impressive piece of efficient C code. I managed to incorporate it into my program and test it, and as far as I can tell it does the job comprehensively. Unfortunately, for certain grids it was slower than my method, and for others it required very large amounts of RAM (tens of Mbs). I think this is because of the way he maintains many long lists of available words and recursive functions. Also at this time I discovered an entire thesis devoted to the subject. This was very interesting - I printed the whole thing out, but the method the author came up with wasn't actually very effective. So for Version 2.1 I went back to the logical fill but made it much smarter. Below there's a simplified version of the main loop.

Version 2.5

Although the Version 2 fill was pretty quick with grids containing not too much white space (as opposed to 'American' style grids) it still bothered me that the dictionary was enormous and threw up a lot of strange words. I abandoned the idea of working out the order of fill in advance, and instead tried to fill the most constrained space each time, ie the one with fewest matches. This worked far better, and because the order of fill was always varying, it was less likely to get into a rut. Also, I preloaded the lists I was going to need and made an array of pointers to them. The searching could still be much faster - next time I'm going to look into somehow indexing the dictionary so that you can immediately jump to the next match.

Back To Site

Here's a summary of the latest fill:

Version 2.5 fill

DoFill()
{
 
....GatherWords(); // get all the wordlists we need and randomise them
 
....thisSlot = firstSlot = GetMostConstrainedSlot(); // get slot with fewest matches
 
....while(thisSlot)
....{				
		
........wordPtr = MatchWord (thisSlot->word); // try to find a match
........if(wordPtr == 0) // no match
........{
............if(thisSlot == firstSlot)
................return kUnable;
............else
............{
................// Go back through the slots in order they were filled until you reach one that 
................// intersects and so affects the current slot. Blank that one and all in between
................BackTrack(&thisSlot);
 
................if(thisSlot == firstSlot)
....................thisSlot = StartAgain(); // reset everything, re-randomise words
............}
........}
........else // provisional match found...
........{
............FillInWord(thisSlot, wordPtr); // ...store so neighbours know about new letters...
			
............// Recalculate the number of possible matches of all slots affected by provisional 
............// new word. Return false if any would now have zero matches
............wordOK = UpdateNeighbourConstraints(thisSlot);
 
............if(wordOK)  // still OK, so draw it in the grid
............{
................DrawSlot(thisSlot);
................thisSlot = GetMostConstrainedSlot();
............}
............else
............{
................mWords->BlankSlot(thisSlot); 
............}
........}
....}
.... //no slots left to fill
....return kSucceeded;

}

(Ignore the full stops - my HTML editor doesn't like tabs!)


There's another site with code on this topic here from which I picked up a few tips, but didn't try maintaing a wordlist of possible matches for each space, because I can't believe that updating these wouldn't be too slow and memory-hungry.

Do please contribute any comments, links etc and I'll add them to this page.

Back To Site


Version 2.1 fill

short
FillGrid()
{ 
. Space* space = nil;
 
. firstSpace = MakeSpaceChain();
. space = firstSpace;
 
. while(space)
. { 
. . matchResult = FindDictionaryMatch(space); 
. . if(matchResult == kNoMatchFound)
. . {
. . . if(space == mFirstSpace)
. . . . return kUnable; 
. . . else
. . . . GetEarlierSectSpace(&space); 
. . }
. . else if (matchResult == kMatchFound)
. . { 
. . . DrawSpace(space);
. . . badLetterNum = CheckFutureMatches(space);
. . . if(badLetterNum == 0)
. . . {
. . . . do {
. . . . . space = space->next;
. . . . } while(space->state == alreadyFilled);
 
. . . . if(!space)
. . . . . break;
. . . . UpdateSpace(space); 
. . . }
. . . else
. . . { 
. . . . char badLetter = space->string[badLetterNum + 1];
. . . . letterMask = 1L << (badLetter - 'A');
. . . . mask[badLetterNum] &= ~letterMask;
. . . . EraseWord (space);
. . . . DrawSpace(space); 
. . . }
. . }
. . else
. . . return kUnable;
. }
. return kSucceeded; 
}

Brief explanation of functions:

MakeSpaceChain() 

Establish search order for working through spaces. The 'Space' struct includes pointers to 'previous' and 'next' spaces. Order is basically longest to shortest, but with bias towards intersecting spaces. The more I work on this, the more important I think this aspect is, and could be improved.

UpdateSpace() 

Checks what letters a space currently contains (might have been crossed by another since we last checked)

FindDictionaryMatch() 

Finds a match from the wordlist based on a 32 bit mask for each letter, in turn based on Space.string. If the bit is set, the letter is allowed. So as well as saying "must have a 'G' here" ie:

00000000 00000000 00000000 01000000
      ZY XWVUTSRQ PONMLKJI HGFEDCBA

can also say "Must NOT have 'X' or 'J' here":

00000011 01111111 11111101 11111111
      ZY XWVUTSRQ PONMLKJI HGFEDCBA
 
GetEarlierSectSpace() 

Works back through chain to find most recent space intersecting current problem space. Also erases intervening spaces if they intersected this found space.

CheckFutureMatches() 

Checks that all intersecting spaces of this provisional word COULD be filled. If not, return position of problem letter and update mask accordingly.

Back To Site