[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs stuck in queue



Mr. Cannini,

You're receiving these errors because Condor is trying to be cautious with the power you give it.  "With great power comes great responsibility."  Root processes have the power to change their effective user and group IDs while they are running.  So, even though Condor is being run as root at first, Condor only uses that power when it needs it.  When Condor is doing normal Condor stuff that doesn't need the extra permissions, it changes its effective user and group IDs to be 'condor'.  That is why when you check the Condor processes with ps or top, they almost always are listed as being owned by the 'condor' user and group.  When Condor needs the extra permissions, it changes its effective user ID to be root but then changes back to 'condor' when its done doing the dangerous stuff.

Because of this, perhaps the '/var/spool/condor/' directory or one of its subdirectories needs to be owned by root:root.  I have mine owned by condor:condor, though, so I don't know why this is a problem.  Try chowning it to 'root:root' and see if that helps.
For a similar reason, perhaps '/var/lib/condor/execute/' needs to be owned by root:root.  (Root-squashed usually refers to not giving special permissions to a local 'root' user on a shared filesystem that doesn't care about root, I think.)  Why is this directory have the sticky bit set, though?  (According to the "t" in the "drwx-rwx-rwt" permissions.)  Try unsetting the sticky bit in '/var/lib/condor/execute/' by running 'chmod -t /var/lib/condor/execute' as root.  My execute directory doesn't have the sticky bit set, so I think it's safe to unset it (I don't think it's set by default, that is).

Hopefully, this will fix your problems or at least get you that much closer to figuring it all out once and for all.  I don't know why the job stays stuck on the queue.  Unfortunately, I'm not yet familiar with the parallel universe.  What I do know is that after you make these changes and correct the most recent errors in your log files, restart Condor and try again.  If they still stay in the queue, run the 'condor_q -better-analyze' to see if you get more information this time.  Before, it mentioned that your job didn't match any resource constraints, which tells me that the Requirements of the job and the capabilities of the machine don't quite match up right.  Look through the log files I mentioned again to see if you get any new errors.  If 'condor_q -better-analyze' and the log files don't help, give me the output of 'condor_q -long' for the appropriate cluster/job and 'condor_status -long' for the appropriate machines (node-01 and node-02?).

Best Regards,
 ~ Garrett K.
condor.cs.wlu.edu

On Aug 23, 2011, at 7:09 PM, Fabricio Cannini wrote:

> Em segunda-feira 22 agosto 2011, às 15:16:28, Koller, Garrett escreveu:
>> Mr. Cannini,
>> 
>> Oh, I think I'm beginning to see the problem.  Look at the StartLog and note
>> the authentication errors:
>>> 08/19/11 17:21:30 PERMISSION DENIED to unauthenticated@unmapped from host
>>> 172.17.8.121 for command 442 (REQUEST_CLAIM), access level DAEMON:
>>> reason: DAEMON authorization policy contains no matching ALLOW entry for
>>> this request; identifiers used
>> 
>> The "unauthenticated@unmapped" part means that you simply do not have
>> authentication configured correctly.  First of all, what forms of
>> authentication are you trying to use?  Run 'condor_config_val -v
>> SEC_CLIENT_AUTHENTICATION_METHODS' and 'condor_config_val -v
>> SEC_DEFAULT_AUTHENTICATION_METHODS' to find out. The typical forms are FS,
>> FS_REMOTE, and PASSWORD.  To learn more about how they work, look at
>> http://servo.cs.wlu.edu/dokuwiki/doku.php/condor/administration/authentica
>> tion .  Look at http://condor.cs.wlu.edu/condor/config/condor_config_global
>> for an example Condor configuration that uses authentication (Ctrl-F and
>> search for "Authentication").
>> 
>> Once you have authentication correctly configured, the authentication will
>> allow daemons to identify themselves to Condor as "<username>@<hostname>".
>> If Condor runs as the user 'condor' (or as 'root' pretending to be
>> 'condor') on the computer 'condor.cs.wlu.edu', for example, then that
>> means that you need to add "condor@xxxxxxxxxxxxxxxxx" to the ALLOW_DAEMON
>> configuration variable to let the daemons communicate.
>> 
>> Does this make sense?  If so, does this help?
> 
> Hi.
> 
> I've completely disabled auth and negotiation on both master and nodes with 
> 'SEC_DEFAULT_AUTHENTICATION=NEVER' and 'SEC_DEFAULT_NEGOTIATION=NEVER' 
> temporarily, and voilá, the "PERMISSION DENIED" messages stopped.
> 
> Now i've found the following message in master's SchedLog:
> ===============================
> 08/23/11 18:35:25 (pid:8028) Inserting new attribute Scheduler into non-active 
> cluster cid=41 acid=-1
> 08/23/11 18:37:45 (pid:8028) Attempting to chown 
> '/var/spool/condor/41/0/cluster41.proc0.subproc0' from 1000 to 0.0, but the 
> path was unexpectedly owned by 104
> ===============================
> 
> Both master and nodes daemons are configured to run as root:root, so this 
> conflict seems strange. 104 is the 'condor' user id, btw.
> 
> 
> Also, on node1 StartLog, there is this weird message:
> ===============================
> 08/23/11 19:27:15 WARNING: /var/lib/condor/execute root-squashed or not 
> condor-owned: requiring world-writability
> ===============================
> 
> But '/var/lib/condor/execute/' permissions are 'condor:condor drwx-rwx-rwt' .
> Weird, huh ?
> 
> And still, any job that i submit using "universe = parallel' keeps getting 
> stuck in the queue.
> 
> TIA
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/