Update: I felt this double half access layout would be better if the metric is GCA instead of GAC, it seems really efficient. There must be a layout where you can do more intermediate with atom arms and careful juggling and results in less area, just not bothering to find it. Yet another layout with tiny change, -2a