I was pretty happy to design a period 6 pipeline that assembles a product with minimal latency and readily copied it twice in different directions, but getting the assembled products back to the output is a real head scratcher. I think a saner way to reduce cycles is to make two period 4 pipelines which can form products much closer together, but couldn't figure it out and I spent too long on this solution already.