1. Conclusions
If you don’t want to stick around for the whole story:
- Result #1
-
/#(\h\h\h\h\h\h)/
is about 10% faster than/#(\h{6})/
- Result #2
-
The performance advantage between
/&#x(\h{2,5});/
versus unrolling the first two matches/&#x(\h\h\h{0,3});/
is maybe 5% in favor of unrolling. - Result #2a
-
[0-9a-fA-F]
is about the same as matching\h
.
2. Story time
2.1. Act 1
In my work flow for creating class time notes and handouts in AsciiDoc (using Asciidoctor), I occasionally wish to render a document to PDF to print as a handout. In that case, I then go through and add page breaks and tweak the layout some to fit the medium and intended flow of the activity.
The way this rendering has been done was to output the HTML, then render that page to PDF either by printing to the PDF driver, or a PhantomJS-based module that works in my make
-based flow.
This convoluted flow to handle floating images to reduce unnecessary vertical space on the printed page.
This time, I checked in on the asciidoctor-pdf extension to see how it has improved for this use case.
Immediately ran into handling of special characters such as Ω and μ for electrical unit labels of Ohms and SI prefix micro.
These were handled in a set of attribute substitutions which are include::subs.adoc[]
ed at the top of most documents to also include a non-breaking space between the number and unit.
In a way, porting a core feature of the siunitx
LaTeX package to AsciiDoc sources.
:kOhm: Ω
So then 4.7 Ω
is nicely written as 4.7{kOhm}
, which renders as 4.7 Ω
This strategy has worked well for the adoc→HTML and adoc→HTML→PDF flows, but causes asciidoctor-pdf
backend to choke because named entity references are not supported.
… reading the documentation …
Hit on Issue 969: HTML entities like inside block causing 'Failed to parse formatted text'
then to Issue 486: Add support for hexadecimal character references to figure out that both decimal and hexadecimal character references are supported, with the appropriate font choice.
Named references other than
are right-out for good reasons.
Ok, no problem, just update my subs.adoc
to use numeric references.
But, since I still like the named references because they are easy to remember / guess, so add a layer of substitution to make the file easier to read:
:Omega: Ω :kOhm: {nbsp}{Omega}
Asciidoctor attributes are case insensitive, so Omega and omega are treated the same.
A potential problem since these two Greek letters not only look different (Ω vs. ω), but have different meanings in my electronics world.
Note-to-self, but we’ll deal with that distinction later if needed.
|
2.2. Act 2
… and the PDF backend still chokes :(
-
The entity was copy-pasted directly from my favorite reference https://dev.w3.org/html5/html-author/charref
-
No typos
-
→ Whuzzup?
Try the decimal version (though the documentation indicates there shouldn’t be a difference)
It works! Hmmm
Experiment and try Ω
, with lowercase a
, which also works fine.
Not accepting uppercase hex characters A-F
throws a kink into my copy-paste trick, but a comment in the subs.adoc
file would handle that well enough.
2.3. Act 3
Fortunately, I have the content of the handout sorted out for the next class time and therefore may be able to justify some time to drill down to why this happens.
Clone and ripgrep
the source for the error message and find that the CharRefRx
regex only matches [0-9a-f]
instead of {[0-9A-Fa-f]
or [[:xdigit:]]
or \h
}.
Can’t be too hard to fixup, despite not really knowing Ruby, can it?
Start filing asciidoctor-pdf Issue #1990.
The Contributing Code document looks easy to follow and has reasonable prerequisites, so also start working on the problem itself.
Nearly an hour of this was figuring out why the changes I was making to the source weren’t getting picked up when running the converter on a test file.
Much of this was then figuring out how bundle exec
works.
Ultimately the core issue was not also modifying the parser, after which things made more sense!
Figure out the testing, add a failing test, a commit that makes the test pass again and end up with PR #1991.
Dan Allen (@mojavelinux), the project lead, responds back quickly with nice and helpful comments about the Pull Request (thanks fellow Dan!).
One of which was that regex \d\d\d{0,4}
is faster than \d{2,6}
, hence please keep the existing style.
Ok, no problem, it’s my contribution that should morph to match the rest of the project usually anyway.
Get to the task of addressing his comments and trying to button-up the PR.
2.4. Act 4
[background music rises]
But, I’m also currently teaching “Embedded Microcontrollers 2” where the present topics are orbiting around assembly language, system (CPU) architecture, and performance. In fact, tomorrow’s activity is counting cycles.
and minutes later, some students stopped by with some capstone project advice, wherein I encouraged them to measure some things to help support and inform their decisions.
and I’m curious, thinking that:
-
Un-rolling a loop is a well-known speedup.
-
\d\d\d{0,4}
"unrolls" two matches before getting to the actually-variable number of matches. -
Other parts in the code use
#\h{6}
to match a HTML/CSS hexadecimal color (a count with only one value).
I wonder if expanding to #\h\h\h\h\h\h
would also be faster since I’m already addressing related comments in the PR?
immediately followed by: How do I benchmark a Ruby regex?
… and down another rabbit trail we go ;)
Arrive at Maciek Rząsa’s Performance of Regular Expressions article, which:
-
Is in Ruby.
-
Has enough context for me to reproduce the environment on my machine and run the benchmark.
After some more curiosity-fueled experimentation, end up with the following setup:
source 'https://rubygems.org'
gem 'asciidoctor'
gem 'benchmark-ips'
gem 'rouge'
require "benchmark/ips"
PART = <<TEXT
Entities 򫅘 and  ༄ ' !
Colors #012345 and #abCDef and #000000
TEXT
TEXT = PART * 100
COUNTED2 = /&#x(\h{2,5});/
COUNTED1 = /&#x(\h\h{1,4});/
COUNTED0 = /&#x(\h\h\h{0,3});/
COUNTED0a = /&#x([0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]{0,3});/
SIXCOUNT = /#(\h{6})/
SIXUNROLL = /#(\h\h\h\h\h\h)/
def count(string, regex)
string.scan(regex).count
end
def compile_regex(regex)
/(?<=\s)(#{regex})(?=\s)/
end
COUNTED2_WHOLE = compile_regex(COUNTED2)
COUNTED1_WHOLE = compile_regex(COUNTED1)
COUNTED0_WHOLE = compile_regex(COUNTED0)
COUNTED0a_WHOLE = compile_regex(COUNTED0a)
SIXCOUNT_WHOLE = compile_regex(SIXCOUNT)
SIXUNROLL_WHOLE = compile_regex(SIXUNROLL)
def measure(string, x)
x.report('counted2') { count(string, COUNTED2_WHOLE) }
x.report('counted1') { count(string, COUNTED1_WHOLE) }
x.report('counted0') { count(string, COUNTED0_WHOLE) }
x.report('counted0a') { count(string, COUNTED0a_WHOLE) }
x.report('sixcount') { count(string, SIXCOUNT_WHOLE) }
x.report('sixunroll') { count(string, SIXUNROLL_WHOLE) }
x.compare!
end
def benchmark(string)
Benchmark.ips do |x|
measure(string, x)
end
end
benchmark(TEXT)
There are two independent “research questions” in these measurements:
- Question #1
-
six*: Is unrolling a fixed number of matches faster?
- Question #2
-
counted*: Does unrolling the first matches of a variable count execute faster?
- Question #2a
-
counted*: Are character ranges faster than
\h
for hex matching?
Run the following command a few times:
$ (echo; date -Is; bundle exec ruby regex-count.rb) | tee -a log.txt
then clean up the results:
2022-01-14T22:23:07-06:00 Comparison: sixunroll: 9084.6 i/s sixcount: 8428.4 i/s - same-ish: difference falls within error counted0a: 7326.4 i/s - 1.24x (± 0.00) slower counted0: 7224.1 i/s - 1.26x (± 0.00) slower counted1: 6756.1 i/s - 1.34x (± 0.00) slower counted2: 6714.6 i/s - 1.35x (± 0.00) slower 2022-01-14T22:23:49-06:00 Comparison: sixunroll: 8967.5 i/s sixcount: 8109.3 i/s - 1.11x (± 0.00) slower counted0a: 7187.8 i/s - 1.25x (± 0.00) slower counted0: 7151.9 i/s - 1.25x (± 0.00) slower counted1: 6947.7 i/s - 1.29x (± 0.00) slower counted2: 6914.6 i/s - 1.30x (± 0.00) slower 2022-01-14T22:24:32-06:00 Comparison: sixunroll: 8462.3 i/s sixcount: 7988.5 i/s - same-ish: difference falls within error counted0: 7235.9 i/s - 1.17x (± 0.00) slower counted0a: 7187.3 i/s - 1.18x (± 0.00) slower counted1: 7081.8 i/s - 1.19x (± 0.00) slower counted2: 6963.3 i/s - 1.22x (± 0.00) slower 2022-01-14T22:25:14-06:00 Comparison: sixunroll: 9179.8 i/s sixcount: 8468.7 i/s - same-ish: difference falls within error counted0a: 7353.3 i/s - 1.25x (± 0.00) slower counted0: 7265.4 i/s - 1.26x (± 0.00) slower counted1: 7153.0 i/s - 1.28x (± 0.00) slower counted2: 6972.5 i/s - 1.32x (± 0.00) slower
Considering these benchmarks, I come to the following conclusions:
- Result #1
-
Repeating the character class explicitly instead of using
{N}
is slightly faster, about 10% in these tests. - Result #2
-
Explicitly unrolling the non-variable part of a match length may be about 5% faster.
- Result #2a
-
No clear difference between
[0-9a-fA-F]
and\h
.