Hey just going to throw this hack out there..
So on some of my machines, i probably have some HW issue , or overheating GPU .. or something (i can't be bothered to debug it..)
Sometimes I'd get into a infinite loop throwing errors.. it just never recovers.
I run supervisor on linux.. so I ended up just throwing an exit(0) in the errorcheck define:
#define checkCudaErrors(x) \
{ \
cudaGetLastError(); \
x; \
cudaError_t err = cudaGetLastError(); \
if (err != cudaSuccess) \
{applog(LOG_ERR, "GPU #%d: cudaError %d (%s) calling '%s' (%s line %d)\n", device_map[thr_id], err, cudaGetErrorString(err), #x, __FILENAME__, __LINE__);exit(0); }\
}
It seems to work.. every time it gets into the loop it'll just bail and supervisor restarts it, and for wahtever reason.. it works for a few hours.. then bails, but .. pretty much works 98% efficiency .. instead of being totally dead..
So just throwing this out there.. (yes i know this is a super lazy hack
