Tag Archives: GPU

Cluster GPU Thermal Monitoring

The research group has been writing some simple monitoring scripts for handling the clusters. The focus is mostly on montitoring NAK (page in serious need of update), which has always had thermal irregularities with it’s GPUs. Some of the (poorly designed) GPU coolers have recently finished cooking their fans, and the “repair” has been to remove the cowling and mount an 80mm fan in the case to blow across the heatsink — this produces comparable temperatures to the vendor solution, whch is pathetic. This thermal instability requires that the system temperatures be periodically checked, and we have written variety of colorful scripts both for users and for the displays in the front of the machine room. The one I wrote for my own use is a simple combination of bash and AWK, which produces nice colorized one-line summaires for each machine when run with something like “mpirun –hostfile ~/nakhosts ./pstatc.sh | sort” where nakhosts is a standard MPI-friedly list of hosts, and ~/bin/ has nvidia-smi (a little tool for handling nivida GPUs from the command line) exported to the nodes. Script attached here for perusal (and so I can find it later). Possibly the best part is that it made me referesh my memory on using ANSI Color Escapes, which has been on my list of skills to touch up for a while – That foray also lead to souping up the script Hank was working on to use background colored spaces for ghetto bargraphs to keep the displays in the windows of the machine room interesting until we are set up to drive them with something else. One of these days I really should learn to use ncurses, or at least get better with one of the GUI libraries…

Posted in Computers, DIY, General, School | Tagged , , | Leave a comment