[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_run vs. condor_submit and non-nfs directories



On 5/30/07, Christoph Spielmann <cspielma@xxxxxxxxxx> wrote:

 Erik Paulson wrote:
 On 5/30/07, Christoph Spielmann <cspielma@xxxxxxxxxx> wrote:


 hi everybody!

We use condor on one of our linux-clusters here. The installation seems
to be okey, but when i try to submit a job to condor from a non-nfs
directory it failes with the famous condor_shadow (condor_SHADOW)
EXITING WITH STATUS 112 error message. The detailled error message is:

5/30 12:15:37 (2203.0) (29451): Job 2203.0 going into Hold state (code
6,2): Error from starter on vm2@xxxxxxxxxxxxxxxxxxxxxxxx: Failed to
execute '/tmp/.condor_run.29439': No such file or directory
5/30 12:15:37 (2203.0) (29451): ZKM: setting default map to (null)
5/30 12:15:37 (2203.0) (29451): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 112

I searched the mailing-list archives and found quite alot of ppl with
the same problems but none of the proposed solutions worked for us. We
tried to work with version 6.8.5 and 6.9.2 both dynamically linked. The
problem shows up on both versions. Sometimes it does work but in 99 % of
the trial runs it doesn't.

The funny thing is that it doesn't work when i use condor_run in
combination with a shell-command like /bin/hostname or /bin/date but
when i write a simple hello-world c-program, a submit description file
for that c-program and submit the description file with condor_submit it
works as expected. Even on non-nfs directories!


 condor_run does not use file transfer. You must have a shared
filesystem to use condor_run, or at least have the executable in the
same place in every machine. (That is why /bin/hostname works).

I'd bet the reason it works on a few occasions is that every now and
then your job runs on the submit machine, and can find the executable.

-Erik
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


 Hello Erik!

 Well i just checked and both hostname AND date are on all machines in the
same place (/bin) so that's not the problem. Actually the filesystem-root of
the nodes is mounted via nfs just machine-specific things like /tmp, /etc...
are mounted seperately on each machine. But are mounted all on the same
place of course...


But that's still your problem - if you use condor_run from /tmp,
Condor assumes that /tmp is shared between all machines. However, /tmp
is private to each machine, which is why Condor on the remote machine
complains:

Error from starter on vm2@xxxxxxxxxxxxxxxxxxxxxxxx: Failed to execute
'/tmp/.condor_run.29439': No such file or directory

You need to use condor_run in a directory that's shared between all
hosts (like /home or /users or however you might have you machines
setup.)

Or, just don't use condor_run, and just use condor_submit with file transfer.

-Erik



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/