Refactor to clarify the core algorithms and isolate the
forward/reverse differences to one section.
This also lets the insertion order to be set by Attribute.
Benchmarking this appears to be marginally faster than
Alexander Krotov's original version. It's not clear why;
should be marginally slower.