My GPU was SICK for several days and I missed that because monitoring script reading API port 4028 reported OK. Is there a way to detect SICK card through API?
cgminer 3.2.2 reported:
[2013-07-28 11:29:54] Stratum from pool 1 detected new block
[2013-07-28 11:29:55] Pool 1 stale share detected, discarding
[2013-07-28 11:29:56] Accepted 876fa488 Diff 4/2 GPU 0 pool 1
[2013-07-28 11:31:28] Stratum connection to pool 1 interrupted
[2013-07-28 11:31:28] Lost 517 shares due to stratum disconnect on pool 1
[2013-07-28 11:31:30] Pool 1 stratum share submission failure
[2013-07-28 11:32:00] Pool 1 communication resumed, submitting work
[2013-07-28 11:32:00] Rejected acc5c400 Diff 3/2 GPU 0 pool 1
[2013-07-28 11:32:32] GPU0: Idle for more than 60 seconds, declaring SICK!
[2013-07-28 11:32:32] GPU0: Attempting to restart
[2013-07-28 11:32:32] Thread 0 still exists, killing it off
[2013-07-28 11:32:32] Thread 0 restarted
"devs" report for SICK card:
echo '{"command" : "devs"}' | nc localhost 4028 | tr -d '\0' | python -mjson.tool
{
"DEVS": [
{
"Accepted": 694192,
"Diff1 Work": 2380127,
"Difficulty Accepted": 1360131.0,
"Difficulty Rejected": 436120.0,
"Enabled": "Y",
"Fan Percent": 56,
"Fan Speed": -1,
"GPU": 0,
"GPU Activity": 0,
"GPU Clock": 157,
"GPU Voltage": 1.1,
"Hardware Errors": 0,
"Intensity": "18",
"Last Share Difficulty": 2.0,
"Last Share Pool": 1,
"Last Share Time": 1375003796,
"Last Valid Work": 1375003890,
"MHS 5s": 0.0,
"MHS av": 0.17,
"Memory Clock": 300,
"Powertune": 0,
"Rejected": 222954,
"Status": "Alive",
"Temperature": 40.0,
"Total MH": 156195.3567,
"Utility": 44.85
}
],
"STATUS": [
{
"Code": 9,
"Description": "cgminer 3.2.2",
"Msg": "1 GPU(s) - ",
"STATUS": "S",
"When": 1375361725
}
],
"id": 1
}
I am not sure but this could be a bug. I can try to detect SICK state from several parameters (MHS 5s, GPU Activity, Temperature) but is it correct way? If it is, what parameter should be used for detection?
BTW, reported parameter "MHS av" is wrong, it was 0.00, because card was sick for several days...