Sun Gid Engine (SGE) stuck in qw state

Recently I was pulling my hair out on an HPC cluster that was running SGE (Sun Grid Engine) and had jobs that were getting stuck in "qw" state. The cluster had plenty of open slots so it was not that there was not room to run.

In this case the nodes that were having trouble were newly added to an existing, working, cluster. I started by running the usual suspects:

qstat -f
qstat -j [job id]

Both of these did not return anything out of the ordinary, the qstat -j did not list any errors.

Then I remembered to check the actual queue statues:

qstat -f|grep 'E'

AH HA! The queues all had an 'E' in the state, this means that they were being held back in error, so no matter what no jobs would be processed. The following URL has a great write up going into more detail about why this happens (http://gridengine.info).

The short anwser is to use the following command:

qmod -c '*'

Running the job again, I now get an Eqw state, checking the error it complains that "can't find directory "active_jobs/..."

Hmm, the spool directory was present, so I checked the directory permissions, they looked right. Then I checked the UID and GID, bingo!

SGE is picky, the UID and GID on the node and submit host must be the same. On Linux you would check the UID from /etc/passwd and make sure that the UID is the same on the node in question.

After you make this change you will need to reset the permissions on the users /home folder and also the $SGE_ROOT folder:

chown -R username /opt/sge_path
chown -R username /home/username
chgrp -R username /home/username