Gross updates coming soon
We have a ton of updates on the TODO list for Gross. In the meantime we have brought seven nodes back online and all nodes are currently up and ready for work.
- compute-0-17: System was hung with "grub" message (failed to restart). Reinstalled and brought back online.
- compute-0-18: New power supply arrived from TeamHPC and was installed.
- compute-0-19: System was hung with "grub" message (failed to restart). Reinstalled and brought back online.
- compute-0-21: System was hung with "grub" message (failed to restart). Reinstalled and brought back online.
- compute-0-24: System was hung at "restarting system" message. Reinstalled and brought back online.
- compute-0-25: System was hung with "grub" message (failed to restart). Reinstalled and brought back online.
- compute-0-33: CPU 2: Machine Check Exception 4 Bank 4 Kernel Panic... This system may have some bad memory modules. It has been reinstalled and is back up but should be monitored. Hopefully a pattern will emerge.
So now that we have things up we'd like to continue the improvements. Here is a list of things that we expect to be doing in the next month or so. The December holidays are fast approaching so a lot of this will be pushed into January.
- compute-0-33: We need to keep an eye on c0-33 for potential memory issues.
- Infiniband Switch: It would be nice to get the topspin firmware upgraded. I'm going to poke the people that I think should be able to get us somewhere again.
- Host names and locations: on the latest Rocks install we got out of sync with the node host names and physical positions in the rack. I'd like to correct this to help keep track of hardware problems.
- issue database: We're starting to maintain an issues database for problems related to gross. We're doing this in mysql currently and hopefully we will have an interface on our website.
- power/switches: We need to get the power and network switches under the control of rocks. Also we need to get remote control of the power outlets working again.
- Condor: Remove the condor roll from compute nodes since it was installed initially by mistake.