Saved some cycles over the previous by improved input deconstruction and Painter Wheel thruput. The downside to this design is that it requires an extra move to output. I didn't shave nearly as many cycles as I would have liked, but I do see some ways to further improve it.