Even worse than error message is incorrect results. I worked on the OpenCL neural net evaluation backend used in Leela Zero and lc0 Go and chess bots. We had reports of several OpenCL drivers being so broken that they gave incorrect results while appearing to work correctly without giving any error messages. Intel integrated GPUs on Apple were the worst offender and it looks like the drivers are never going to get fixed. Some older AMD cards had similar issues. We had to add a check that GPU NN evaluation matches CPU reference to catch these broken drivers.
No comments yet.