2019-05-25 15:30

Title Casing Is Harder Than I Thought

You've probably noticed that many article titles use stylistic formatting called "title casing". Recently I wanted to add a titlecase method to the Msgtrail static blog engine. Quickly I realized that title casing is harder than I thought!

Let's begin with a few examples to demonstrate some edge cases:

  • In: Small word at end is nothing to be afraid of
  • Out: Small Word at End Is Nothing to Be Afraid Of

Notice that:

  • Small words like at and to are not title-cased;
  • Is is title-cased;
  • Of is title-cased, but only because it's the last word.

Or take this example:

  • In: Never touch paths like /var/run before/after /boot
  • Out: Never Touch Paths Like /var/run Before/After /boot

Notice that:

  • /var/run and /boot remain untouched;
  • before/after becomes Before/After.

See here for additional test cases.

While researching the topic I ran into an article by John Gruber about title casing. His article points to a Perl script which he uses to title-case the articles of his (magnificent) blog. The article also points to implementations in other languages, including Ruby.

I looked at several implementations in order to understand the rule set:

  • Gruber's script is clever, but hard to read for a non-Perl coder like me. It's basically a set of regular expressions.
  • Aristotle Pagaltzis refactored it to make it more readable.
  • Sam Souder created a Ruby version in the form of a gem called 'titlecase'. Sam's version is succinct (about 30 lines of code) but fails 13 of Gruber's test cases (which is fine, it is not the gem's intention to support all edge cases).
  • Grant Hollingworth created another version in Ruby in the form of a gem called "titleize". It has about 50 lines of code and fails 3 of Gruber's test cases (again, this is fine).

I ended up writing my own implementation which has about 40 lines of code and passes all tests.

My implementation is a fraction faster than "titlecase" and 3x faster than "titleize". Benchmarking a run on 10.000 English sentences yields:

  • My implementation: 0.8611196667 seconds (average over 3 runs).
  • Titlecase gem: 0.9050176667 seconds (average over 3 runs).
  • Titleize gem: 2.7170636667 seconds (average over 3 runs).

It was fun to write this code because it was a challenge to make it fast, readable, and pass all test cases. I am planning to turn the code into a Ruby gem. I have released a Ruby gem based on this implementation: https://github.com/evaneykelen/nicetitle.

§ Permalink

ξ Comments? Kudos? Find me on Twitter