[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How To TroubleShoot Flocking



Does the /usr/local/condor/home/execute/dir_2586/condor_exec.exe file
exist? What file permissions and ownership?

I can't see what is going on, the line "File transfer completed successfully."
implies that files have been transferred successfully. 
Having said that, you have no input file specified (maybe you could try
adding an "input = empty.txt" to make sure that gets transferred OK), and you
tell it not to transfer the executable, so since it is successfully transferring
nothing, maybe we can't guarantee that any firewalls are being traversed.

You have a fixed UIDDOMAIN. You do have an account for the submitting
User on every machine do you?

What are UIDDOMAIN, Arch, OpSys on 
a) submit machine
b) exec machine
[sorry if this was in previous post]

Can you do
ls -l /bin/hostname
on exec machine

and on
/usr/local/condor/home/execute
and
/usr/local/condor/home/execute/dir_2586

[maybe it is a file ownership or perms problem]

Do your condor daemons run as root?

[sorry most above are abritrary questions, but there many straws to
clutch at and I am trying several]

cheers

JK
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of John Alberts
Sent: Friday, July 07, 2006 9:15 PM
To: Condor-Users Mail List
Subject: RE: [Condor-users] How To TroubleShoot Flocking


Again, thanks everyone for trying to help.
Here is what I have done:
I changed the submit file to the following:
  Executable     = /bin/hostname
  Requirements    = UidDomain == "condor.calumet.purdue.edu" && Arch == "X86_64"
  Universe       = vanilla
  transfer_executable = NO
  should_transfer_files = YES
  when_to_transfer_output = ON_EXIT
  Output         = hostname3.out
  Log            = hostname3.log
  Queue
I also set SHADOW_DEBUG = FULL_DEBUG on the server, which is running all the daemons, including: condor_master, condor_collector, condor_negotiator, condor_startd, and condor_schedd.

The above job failed to execute just as previous jobs.  Here is the contents of the StarterLog.vm1 from the server that should be executing the job.
  7/7 15:09:19 Communicating with shadow <x.x.x.x:41587>
  7/7 15:09:19 Submitting machine is "radon.rcac.purdue.edu"
  7/7 15:09:19 "
  7/7 15:09:20 Starting a VANILLA universe job with ID: 252376.0
  7/7 15:09:20 IWD: /usr/local/condor/home/execute/dir_2586
  7/7 15:09:20 Output file: /usr/local/condor/home/execute/dir_2586/hostname3.out
  7/7 15:09:20 About to exec /usr/local/condor/home/execute/dir_2586/condor_exec.exe condor_exec.exe
  7/7 15:09:20 Create_Process: child failed with errno 2 (No such file or directory) before exec()
  7/7 15:09:20 ERROR "Create_Process(/usr/local/condor/home/execute/dir_2586/condor_exec.exe,condor_exec.exe, ...) failed" at line 387 in file os_proc.C
  7/7 15:09:20 ShutdownFast all jobs.

Does this help at all?

Thanks




John Alberts
Technical Assistant for EMS
alberts@xxxxxxxxxxxxxxxxxx
219-989-2083
CLO 332
http://public.xdi.org/=john.alberts



From: condor-users-bounces@xxxxxxxxxxx on behalf of Kewley, J (John)
Sent: Fri 7/7/2006 9:46 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] How To TroubleShoot Flocking


> ...
> >According to 6.7 manual :
> >  should_transfer_files = <YES | NO | IF_NEEDED >
> >
> >Is True a valid alternative?
> > 
> >
>
> Ah, good question.  I had to dig into the source code to find
> out.  The
> answer is that should_transfer_files=True is equivalent to
> should_transfer_files=Yes.  I didn't realize my
> recommendation relied on
> an undocumented feature!

I suspected it was valid, but had never seen it before. All examples seem to
use YES or IF_NEEDED

2.5.4 and section on condor_submit are where the documented features of this are
(at least in 6.7)

The choice of YES/NO over TRUE/FALSE is a good one. When people see
T/F they immediately thing of a 2-valued logic, so having someing other than
T/F is good when other values are allowed.

I believe these values are case insensitive (but I don't have the luxury of
the code to check!)

I have still never quite worked out why

"NOTE: The combination of:

  should_transfer_files = IF_NEEDED
  when_to_transfer_output = ON_EXIT_OR_EVICT

 would produce undefined file access semantics. Therefore, this combination is
 prohibited by condor_submit."

It is obviously something obvious, but I haven't twigged it yet.

If you are on a system where some machines are in same FileSystemDomain and some aren't,
you may still want (possibly in)complete files to be returned regardless of how job
finished.

JK

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR