Improving The Performance of Spell Checking

Building a comprehensive, on-demand spell checker for investment research professionals

“If your determination is fixed, I do not counsel you to despair. Few things are impossible to diligence and skill. Great works are performed not by strength, but perseverance.”

Dr. Samuel Johnson (1709–1784)

A Little History

On 15th April 1755, Dr. Johnson published “A Dictionary of the English Language”. It took him seven years to finish and he completed it single-handedly. It is one of the most important books in the history of the English language and, until the completion of the Oxford English Dictionary 173 years later, Johnson’s dictionary was viewed as the pre-eminent English dictionary.

Scholars all over the world were compelled to reach for this weighty, multi volume, leather bound tome whenever they needed to know the meaning, pronunciation and correct spelling of a word. Eventually, in 1957, the first research was done into computer based spell checkers. These first iterations were “verifiers” instead of “correctors”. They offered no suggestions for incorrectly spelled words.

The first non-research based spell checker, called Spell for the DEC PDP-10, was created by Ralph Gorin in 1971 at Stanford University.

This was a standalone program that was able to offer suggestions. Beginning in the mid-1980s, word processing applications such as Word, Word Perfect and WordStar began to incorporate such spell checkers.

Nowadays, spell checkers have become a lot more sophisticated. They’re now part of the operating system so, anywhere a user is able to enter text, be it a text field, browser windows, etc., spell checking is performed and suggestions are provided instantly.

The Bipsync Spell Checker

We’ve built a comprehensive, on-demand spell checker for Bipsync. Why did we do this when we get spell checking in the operating system for free? Well, the main problem was inconsistency.

Each operating system implements spell checking differently, as does each browser. We wanted a spell checker that works consistently across all devices.

Additionally the built-in browser spell checkers have no knowledge of our universe of terms for investment management professionals so a ticker name, for example, could be flagged as incorrect. Entering the AAPL, PYPL or ADBE tickers into some browsers would mark the word as misspelled, and we didn’t want this.

The Bipsync spell checker, at its core, uses the powerful Hunspell engine. The user interface is intuitive and very simple to use.

When you create or edit a note, the note content is run through the spell checker and any words it thinks are incorrect are highlighted with a red underscore. Right clicking on the highlighted word then brings up a list of suggestions:

Bipsync’s spell checker in action.

The big difference with the Bipsync spell checker from most browser-based spell checkers is that the work is done on the server rather than on the client. As the user types a note we send its text to our servers via an AJAX call. When it arrives at a server, Hunspell’s dictionaries are unpacked and combined with the user’s custom dictionary to create a “master” dictionary. The text is checked against this in the Hunspell engine. Any suggestions are then returned to the browser and rendered in a context-sensitive menu when the user right-clicks the underlined words.

As a first incarnation, this was a simple and elegant solution.

Think about this though: what if we had thousands of users editing thousands of notes, with each of these notes being run through the spell checker at the same time, causing dictionaries to be unpacked and suggestions found? How would Bipsync cope with such high demand on its spell checker service?

To be perfectly honest, it wasn’t coping. We were seeing a response time of about 6-8 seconds from the user entering a word to the suggestions being sent back, which was unacceptable. We needed a more performant solution that could cope with our growing number of clients.

Our Solution

We met together as a team to discuss how we could improve the architecture. Next we prototyped a few solutions using various technologies, from Vert.x based micro-services using the HunspellJNA library, to custom built, stand-alone PHP TCP servers and background jobs.

Eventually we considered the question: “What do we need to do to get this to market as quickly as possible?”.

Writing a micro-service would take time and it would introduce more technology into our already complex infrastructure. The quickest solution would leverage the tech we already have, so we went with the PHP TCP Server approach.

Our spell check service (which was a simple PHP script we ran on demand) was refactored into a TCP Server which is always running. This is monitored by Supervisord, which we already use to manage our background processes for email imports and data exports, among other things.

The TCP server receives words to check, starts up a Hunspell process and caches it in memory. If more words then came in from the same client, the Hunspell service is already waiting with its stdin and stdout pipelines open. Not having to start the service up every time makes the process much quicker.

The Bipsync spell checker architecture.

A large part of the delay in the original solution was unpacking the dictionary and adding more words to it. We removed this delay by creating a small PHP script that is executed by Cron every hour. This Cron job unpacks the dictionaries and adds words from the Bipsync universe (tickers, labels and such) as well as any user-defined words. We then give the TCP Server a “nudge”, which causes Hunspell to reload its dictionaries and reload and re-cache its processes.

Reloading Hunspell’s dictionaries.

More Improvements

We also optimized the browser code for further performance gains. The original spell checker was taking all of the note content and running it through the spell checker regardless of what part of the note was visible in the browser. We decided to only consider the note content that was visible on screen. As the user scrolls through content and stops, spell checking is performed on the visible section of the note; less words to check means a quicker response.

Clicking on the note list in the left hand navigation and displaying a note in the right hand panel would also trigger spell checking, which is unnecessary. We changed that so spell checking begins when a user starts editing a note.

Has The Perseverance Paid Off?

We did some extensive performance testing pre-release and we got consistent response times of around 200 milliseconds. We were a lot happier with that so we released this new version of the spell checker during the summer of 2017.

As Dr. Johnson said, “Great works are performed not by strength, but perseverance.”.

Being determined to support thousands of simultaneous users, we persevered through several iterations until we were happy with the spell checker’s speed, stability and scalability.