[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem running Grid jobs using Condor.



Found the solution:

It had nothing to do with dns resolution or running the job as nobody. The problem was, even though the home dir (/home/research/bala) existed in both the execute and submit machines, they were not cross mounted. So /'/home/research/bala/.globus/job/vulcan.txcorp.com/9128.1239817731/stdout/' existed in the submit machine but was not accessible from the execute machine.

Tweaked the condor.pm such that the output files are produced in /tmp, and the jobs ran to completion. So I am wondering why condor did not create '/.globus/job/vulcan.txcorp.com/9128.1239817731/stdout'/ in the execute machine even though the condor master in the execute machine was started as root?

Cheers!
.Bala.

Balamurali Ananthan wrote:
Hello,

I am trying to run a job in the condor system submitted through the Globus Gatekeeper.

But the jobs are being held for this reason:

HoldReason = "Error from starter on slot1@xxxxxxxxxxxxxxxxxxx: Failed to open '/home/research/bala/.globus/job/vulcan.txcorp.com/9128.1239817731/stdout' as standard output: No such file or directory (errno 2)"

Here is what I already did:
1. Started the execute machine's master daemon as root.

2. Set the UID_DOMAIN in the condor_config on the execute machine to txcorp.com

3. Set the TRUST_UID_DOMAIN = TRUE on the execute machine

4. The account with which the job is supposed to be run on the execute machine is not in the /etc/passwd file. So the SOFT_UID_DOMAIN = TRUE is set in the execute machine.

However, the execute machine (10.0.0.2) cannot do a dns lookup. So there is no way the execute machine can DNS resolve 10.0.0.105 to vulcan.txcorp.com which is the submit machine, although /etc/hosts can be used to resolve 10.0.0.105 to vulcan.txcorp.com

Questions:
1. Does the execute machine depends only on dns to resolve the ip address to its name? And if it fails does it run the job as nobody?

2. How do I see with what account the job is tried to run as? I'm guessing that the job is run as nobody while it is supposed to be running as bala. How do I check it?

Thanks much!



--
Balamurali Ananthan (bala@xxxxxxxxxx) (720.974.1843)	
Tech-X Corp, 5621 Arapahoe Ave, Suite A, Boulder, CO 80303