It certainly isn't out of reach to get a fairly close speed to GMP implementation-wise if you are willing to optimize the low-level loops in assembly. I think the simple cases are rather straight-forward to reach parity but once you start needing to optimize your algorithm thresholds, it requires much more testing to find the optimal values [1].It is also easy to overlook how well optimized GMP is across a wide range of less common architectures and chips and I wouldn't be surprised if my particular implementation lost a bit of ground on other architectures like ARM (would be a good thing to test).
[1] https://gmplib.org/devel/thres/
No comments yet.