Ruby regex speed for parsing HTML entities

1. Conclusions

If you don’t want to stick around for the whole story:

Result #1: /#(\h\h\h\h\h\h)/ is about 10% faster than /#(\h{6})/
Result #2: The performance advantage between /&#x(\h{2,5});/ versus unrolling the first two matches /&#x(\h\h\h{0,3});/ is maybe 5% in favor of unrolling.
Result #2a: [0-9a-fA-F] is about the same as matching \h.

2. Story time

2.1. Act 1

In my work flow for creating class time notes and handouts in AsciiDoc (using Asciidoctor), I occasionally wish to render a document to PDF to print as a handout. In that case, I then go through and add page breaks and tweak the layout some to fit the medium and intended flow of the activity.

The way this rendering has been done was to output the HTML, then render that page to PDF either by printing to the PDF driver, or a PhantomJS-based module that works in my make-based flow. This convoluted flow to handle floating images to reduce unnecessary vertical space on the printed page.

This time, I checked in on the asciidoctor-pdf extension to see how it has improved for this use case. Immediately ran into handling of special characters such as Ω and μ for electrical unit labels of Ohms and SI prefix micro. These were handled in a set of attribute substitutions which are include::subs.adoc[]ed at the top of most documents to also include a non-breaking space between the number and unit. In a way, porting a core feature of the siunitx LaTeX package to AsciiDoc sources.

Example 1. Example attribute usage for SI units

:kOhm:  Ω

So then 4.7 Ω is nicely written as 4.7{kOhm}, which renders as 4.7 Ω

This strategy has worked well for the adoc→HTML and adoc→HTML→PDF flows, but causes asciidoctor-pdf backend to choke because named entity references are not supported.

… reading the documentation …

Hit on Issue 969: HTML entities like   inside block causing 'Failed to parse formatted text' then to Issue 486: Add support for hexadecimal character references to figure out that both decimal and hexadecimal character references are supported, with the appropriate font choice. Named references other than   are right-out for good reasons.

Ok, no problem, just update my subs.adoc to use numeric references. But, since I still like the named references because they are easy to remember / guess, so add a layer of substitution to make the file easier to read:

Updated substitutions

:Omega: &#x003A9;

:kOhm: {nbsp}{Omega}

Asciidoctor attributes are case insensitive, so Omega and omega are treated the same. A potential problem since these two Greek letters not only look different (Ω vs. ω), but have different meanings in my electronics world. Note-to-self, but we’ll deal with that distinction later if needed.

2.2. Act 2

… and the PDF backend still chokes :(

The entity was copy-pasted directly from my favorite reference https://dev.w3.org/html5/html-author/charref
No typos
→ Whuzzup?

Try the decimal version (though the documentation indicates there shouldn’t be a difference)

It works! Hmmm

Experiment and try &#x003a9, with lowercase a, which also works fine.

Not accepting uppercase hex characters A-F throws a kink into my copy-paste trick, but a comment in the subs.adoc file would handle that well enough.

2.3. Act 3

Fortunately, I have the content of the handout sorted out for the next class time and therefore may be able to justify some time to drill down to why this happens.

Clone and ripgrep the source for the error message and find that the CharRefRx regex only matches [0-9a-f] instead of {[0-9A-Fa-f] or [[:xdigit:]] or \h}.

Can’t be too hard to fixup, despite not really knowing Ruby, can it?

Start filing asciidoctor-pdf Issue #1990.

The Contributing Code document looks easy to follow and has reasonable prerequisites, so also start working on the problem itself.

Nearly an hour of this was figuring out why the changes I was making to the source weren’t getting picked up when running the converter on a test file. Much of this was then figuring out how bundle exec works. Ultimately the core issue was not also modifying the parser, after which things made more sense!

Figure out the testing, add a failing test, a commit that makes the test pass again and end up with PR #1991.

Dan Allen (@mojavelinux), the project lead, responds back quickly with nice and helpful comments about the Pull Request (thanks fellow Dan!). One of which was that regex \d\d\d{0,4} is faster than \d{2,6}, hence please keep the existing style. Ok, no problem, it’s my contribution that should morph to match the rest of the project usually anyway.

Get to the task of addressing his comments and trying to button-up the PR.

2.4. Act 4

[background music rises]

But, I’m also currently teaching “Embedded Microcontrollers 2” where the present topics are orbiting around assembly language, system (CPU) architecture, and performance. In fact, tomorrow’s activity is counting cycles.

and minutes later, some students stopped by with some capstone project advice, wherein I encouraged them to measure some things to help support and inform their decisions.

and I’m curious, thinking that:

Un-rolling a loop is a well-known speedup.
\d\d\d{0,4} "unrolls" two matches before getting to the actually-variable number of matches.
Other parts in the code use #\h{6} to match a HTML/CSS hexadecimal color (a count with only one value).

I wonder if expanding to #\h\h\h\h\h\h would also be faster since I’m already addressing related comments in the PR?

immediately followed by: How do I benchmark a Ruby regex?

… and down another rabbit trail we go ;)

Arrive at Maciek Rząsa’s Performance of Regular Expressions article, which:

Is in Ruby.
Has enough context for me to reproduce the environment on my machine and run the benchmark.

After some more curiosity-fueled experimentation, end up with the following setup:

Gemfile

source 'https://rubygems.org'

gem 'asciidoctor'
gem 'benchmark-ips'
gem 'rouge'

regex-count.rb

require "benchmark/ips"

PART = <<TEXT
Entities &#xaB158; and &#xF090; &#xF04; &#x27; !
Colors #012345 and #abCDef and #000000
TEXT
TEXT = PART * 100

COUNTED2 = /&#x(\h{2,5});/
COUNTED1 = /&#x(\h\h{1,4});/
COUNTED0 = /&#x(\h\h\h{0,3});/
COUNTED0a = /&#x([0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]{0,3});/

SIXCOUNT = /#(\h{6})/
SIXUNROLL = /#(\h\h\h\h\h\h)/


def count(string, regex)
  string.scan(regex).count
end

def compile_regex(regex)
  /(?<=\s)(#{regex})(?=\s)/
end

COUNTED2_WHOLE = compile_regex(COUNTED2)
COUNTED1_WHOLE = compile_regex(COUNTED1)
COUNTED0_WHOLE = compile_regex(COUNTED0)
COUNTED0a_WHOLE = compile_regex(COUNTED0a)

SIXCOUNT_WHOLE = compile_regex(SIXCOUNT)
SIXUNROLL_WHOLE = compile_regex(SIXUNROLL)

def measure(string, x)
  x.report('counted2') { count(string, COUNTED2_WHOLE) }
  x.report('counted1') { count(string, COUNTED1_WHOLE) }
  x.report('counted0') { count(string, COUNTED0_WHOLE) }
  x.report('counted0a') { count(string, COUNTED0a_WHOLE) }

  x.report('sixcount') { count(string, SIXCOUNT_WHOLE) }
  x.report('sixunroll') { count(string, SIXUNROLL_WHOLE) }

  x.compare!
end

def benchmark(string)
  Benchmark.ips do |x|
    measure(string, x)
  end
end

benchmark(TEXT)

There are two independent “research questions” in these measurements:

Question #1: six*: Is unrolling a fixed number of matches faster?
Question #2: counted*: Does unrolling the first matches of a variable count execute faster?
Question #2a: counted*: Are character ranges faster than \h for hex matching?

Run the following command a few times:
$ (echo; date -Is; bundle exec ruby regex-count.rb) | tee -a log.txt

then clean up the results:

Output from several runs

2022-01-14T22:23:07-06:00
Comparison:
           sixunroll:     9084.6 i/s
            sixcount:     8428.4 i/s - same-ish: difference falls within error
           counted0a:     7326.4 i/s - 1.24x  (± 0.00) slower
            counted0:     7224.1 i/s - 1.26x  (± 0.00) slower
            counted1:     6756.1 i/s - 1.34x  (± 0.00) slower
            counted2:     6714.6 i/s - 1.35x  (± 0.00) slower

2022-01-14T22:23:49-06:00
Comparison:
           sixunroll:     8967.5 i/s
            sixcount:     8109.3 i/s - 1.11x  (± 0.00) slower
           counted0a:     7187.8 i/s - 1.25x  (± 0.00) slower
            counted0:     7151.9 i/s - 1.25x  (± 0.00) slower
            counted1:     6947.7 i/s - 1.29x  (± 0.00) slower
            counted2:     6914.6 i/s - 1.30x  (± 0.00) slower

2022-01-14T22:24:32-06:00
Comparison:
           sixunroll:     8462.3 i/s
            sixcount:     7988.5 i/s - same-ish: difference falls within error
            counted0:     7235.9 i/s - 1.17x  (± 0.00) slower
           counted0a:     7187.3 i/s - 1.18x  (± 0.00) slower
            counted1:     7081.8 i/s - 1.19x  (± 0.00) slower
            counted2:     6963.3 i/s - 1.22x  (± 0.00) slower

2022-01-14T22:25:14-06:00
Comparison:
           sixunroll:     9179.8 i/s
            sixcount:     8468.7 i/s - same-ish: difference falls within error
           counted0a:     7353.3 i/s - 1.25x  (± 0.00) slower
            counted0:     7265.4 i/s - 1.26x  (± 0.00) slower
            counted1:     7153.0 i/s - 1.28x  (± 0.00) slower
            counted2:     6972.5 i/s - 1.32x  (± 0.00) slower

Considering these benchmarks, I come to the following conclusions:

Result #1: Repeating the character class explicitly instead of using {N} is slightly faster, about 10% in these tests.
Result #2: Explicitly unrolling the non-variable part of a match length may be about 5% faster.
Result #2a: No clear difference between [0-9a-fA-F] and \h.

2.5. Act 5

Having satisfied my curiosity and have an idea of how much extra optimization to put into the original Pull Request, I write this up and table the work for a bit to switch back to some $DAYJOB tasks before the day ends.