So I make some changes for a new telegram alert system
The 30 min telegram notification is good but it gets annoying over time and I had to mute it.
I created a new telegram-alert to pass me alerts from IAmNotAJeep_and_Maxximus007_WATCHDOG
First create a new bot as explained on OP call it Mining Alerts (or whatever you like)
Make new telegram-alert file in m1 home and put these inside :
telegram-alert:
#!/bin/bash
# Telegram Info Script
# By BaliMiner et al...
# for nvOC by fullzero
# ref: http://bernaerts.dyndns.org/linux/75-debian/351-debian-send-telegram-notification
#
source ~/1bash
CHATID=$TELEGRAM_CHATID
APIKEY=$TELEGRAM_ALERT_APIKEY
SYSTEM_BOOT_TIME=$(uptime -s)
GPU_COUNT=$(nvidia-smi -L | tail -n 1| cut -c 5 |awk '{ SUM += $1+1} ; { print SUM }')
STARTING_MINER=$(tail -n50home/m1/5_restartlog | grep Starting | tail -n 1)
LOST_GPU=$(tail -n50home/m1/5_restartlog | grep Lost| tail -n 1)
REBOOT_ALERT=$(tail -n50home/m1/5_restartlog | grep 'reboot in' |tail -n 1)
UTILIZATION_LOW_REBOOTING=$(tail -n50home/m1/5_restartlog | grep 'low: reviving' |tail -n 1)
UTILIZATION_LOW_RESTART_3MAIN=$(tail -n50home/m1/5_restartlog | grep 'low: restart' |tail -n 1)
LOW_UTILIZATION=$(tail -n50home/m1/5_restartlog | grep 'Low Utilization' |tail -n 1)
FAILURES_REINIT=$(tail -n50home/m1/5_restartlog | grep 'Before reinit')
SYSTEM_UP_TIME=$(uptime -p)
REBOOT_REQUIRED=$(/home/m1/reboot-required)
GPU_UTILIZATIONS=$(tail -n 30 5_restartlog | grep 'GPU UTILIZATION' | awk '{gsub(/GPU UTILIZATION: ,"")}1' | tail -n 1)
TEMP=$(/usr/bin/nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)
PD=$(/usr/bin/nvidia-smi --query-gpu=power.draw --format=csv,noheader)
FAN=$(/usr/bin/nvidia-smi --query-gpu=fan.speed --format=csv,noheader)
TEMP_FAN_POWER=$(tail -n 30 6_autotemplog | grep GPU | awk '{gsub(/:/,": ")}1' |tail -n $GPU_COUNT)
LF=$'\n'
MSG=" Worker: $WORKERNAME
Boot Time: $SYSTEM_BOOT_TIME
GPU Count: $GPU_COUNT
GPU Utilization:
$GPU_UTILIZATIONS
$STARTING_MINER
$LOW_UTILIZATION
$FAILURES_REINIT
$UTILIZATION_LOW_RESTART_3MAIN
$LOST_GPU
$REBOOT_ALERT
$UTILIZATION_LOW_REBOOTING
"
/usr/bin/curl -m 5 -s -X POST --outputdev/null https://api.telegram.org/bot${APIKEY}/sendMessage -d "text=${MSG}" -d chat_id=${CHATID}
In your 1bash add a line bellow APIKEY=$TELEGRAM_APIKEY and add your new alert api key
TELEGRAM_ALERT_APIKEY="aaaaaaaaaaaaaaaaaaa:bbbbbbbbbbbb-cccccccccccccccccccccccc"
And here is my modified IAmNotAJeep_and_Maxximus007_WATCHDOG for telegram alerts:
#!/bin/bash
# Miner restart script V002
# By Maxximus007 && IAmNotAJeep
# for nvOC by fullzero
# Modified by papampi for telegram-alerts
export DISPLAY=:0
# Creating a log file to record restarts
LOG_FILE="/home/m1/5_restartlog"
if [ -e "$LOG_FILE" ] ; then
#Limit the logfile, just keep the last 2K
LASTLOG=$(tail -n 2K $LOG_FILE)
echo $LASTLOG
echo ""
fi
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
echo "" | tee -a ${LOG_FILE}
bash '/home/m1/telegram-alert'
# Give oneBash time to start to prevent reboot
echo "$(date) - waiting 70 seconds before going 'on watch'" | tee -a ${LOG_FILE}
sleep 60
THRESHOLD=90
RESTART=0
GPU_COUNT=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)
COUNT=$((6 * $GPU_COUNT))
while true
do
sleep 10 # sleep 60
#IAmNotAJeep MOD from V002
JEEP=0
#IAmNotAJeep MOD from V002
GPU=0
REBOOTRESET=$(($REBOOTRESET + 1))
#IAmNotAJeep MOD from V002
echo ""
echo " GPU_COUNT: " $GPU_COUNT | tee -a ${LOG_FILE}
#IAmNotAJeep MOD from V002
UTILIZATIONS=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)
echo ""
echo "GPU UTILIZATION: " $UTILIZATIONS | tee -a ${LOG_FILE}
echo "" | tee -a ${LOG_FILE}
numtest='^[0-9]+$'
for UTIL in $UTILIZATIONS
do
if ! [[ $UTIL =~ $numtest ]]
then
# Not numeric so: Help we've lost a GPU, so reboot
echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}
echo "" | tee -a ${LOG_FILE}
bash '/home/m1/telegram-alert'
#Hope PCI BUS info will help find the faulty GPU
nvidia-smi --query-gpu=gpu_bus_id --format=csv | tee -a ${LOG_FILE}
echo "reboot in 10 seconds" | tee -a ${LOG_FILE}
echo ""| tee -a ${LOG_FILE}
bash '/home/m1/telegram-alert'
sleep 10
sudo reboot
fi
# If utilization is lower than threshold count them:
if [ $UTIL -lt $THRESHOLD ]
then
echo "$(date) - GPU under threshold found" | tee -a ${LOG_FILE}
echo "" | tee -a ${LOG_FILE}
bash '/home/m1/telegram-alert'
COUNT=$(($COUNT - 1))
#IAmNotAJeep MOD from V002
JEEP=$(($JEEP + 1))
#IAmNotAJeep MOD from V002
fi
GPU=$(($GPU + 1))
done
#IAmNotAJeep MOD from V002
if [ $JEEP -gt 0 ]
then
if [ $COUNT -le 0 ]
then
INTERNET_IS_GO=0
if nc -vzw1 google.com 443;
#if nc -vzw1 $POOL 80;
then
INTERNET_IS_GO=1
fi
echo ""
if [[ $RESTART -gt 4 && $INTERNET_IS_GO == 1 ]]
then
echo "$(date) - Utilization is too low: reviving did not work so restarting system in 10 seconds" | tee -a ${LOG_FILE}
echo "" | tee -a ${LOG_FILE}
bash '/home/m1/telegram-alert'
sleep 10
sudo reboot
fi
echo "$(date) - Utilization is too low: restart 3main" | tee -a ${LOG_FILE}
# If miner runs in screen 'miner' kill the screen to be sure it's gone
pkill -e miner
bash '/home/m1/telegram-alert'
# Best to restart oneBash - settings might be adjusted already
target=$(ps -ef | awk '$NF~"3main" {print $2}')
kill $target | tee -a ${LOG_FILE}
echo "" | tee -a ${LOG_FILE}
RESTART=$(($RESTART + 1))
REBOOTRESET=0
COUNT=$GPU_COUNT
# Give oneBash time to restart to prevent reboot
sleep 60
#fi
else
echo "$(date) - Low Utilization Detected: 3main will reinit if there are 6 consecutive failures" | tee -a ${LOG_FILE}
echo ""
echo " "$COUNT "Failures Before reinit" | tee -a ${LOG_FILE}
bash '/home/m1/telegram-alert'
#IAmNotAJeep MOD from V002
fi
else
#IAmNotAJeep MOD from V002
COUNT=$((6 * $GPU_COUNT))
echo "$(date) - 5 by 5: REMEMBER TO THANK IAmNotAJeep and Maxximus007"
#IAmNotAJeep MOD from V002
fi
# No need for a reboot after a while
if [ $REBOOTRESET -gt 5 ]
then
RESTART=0
REBOOTRESET=0
fi
done
This is how it looks like when there is a gpu under threshold :
Worker: nv102
Boot Time: 2017-09-04 12:17:17
Miner Uptime: 02:45:07
GPU Count: 7
GPU Utilization:
99 83 99 99 98 99 99
Mon Sep 4 15:03:53 IRDT 2017 - GPU under threshold found
Mon Sep 4 15:03:56 IRDT 2017 - Low Utilization Detected: 3main will reinit if there are 6 consecutive failures
41 Failures Before reinit
Hopefull we get more/better integration for alerts from fullzero, Maxximus and IAmNotAJeep.
Thanks all.