[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor_exec fails, shadow fails



Hi,
 
I am trying to start a parallel job. The log files indicate that condor_exec is failing.
 
Execute machine's Starter log contains
--------------------------------------
3/22 16:49:41 Starting a PARALLEL universe job with ID: 19.0
3/22 16:49:41 IWD: D:\condor-6.8.4/execute\dir_784
3/22 16:49:41 Output file: D:\condor-6.8.4/execute\dir_784\foo.out.0
3/22 16:49:41 Error file: D:\condor-6.8.4/execute\dir_784\foo.err.0
3/22 16:49:41 Renice expr "10" evaluated to 10
3/22 16:49:41 About to exec D:\condor-6.8.4\execute\dir_784\condor_exec.exe \\indplly1\userdirs\JeffreySJ\Cond
or_Jobs\cpilog_minimal.exe
3/22 16:49:41 ERROR: D:\condor-6.8.4\execute\dir_784\condor_exec.exe is not a valid Windows executable
3/22 16:49:41 ERROR "Create_Process(D:\condor-6.8.4\execute\dir_784\condor_exec.exe,\\indplly1\userdirs\Jeffre
ySJ\Condor_Jobs\cpilog_minimal.exe, ...) failed" at line 393 in file ..\src\condor_starter.V6.1\os_proc.C
3/22 16:49:41 ShutdownFast all jobs
 

The shadow process is apparently dying.
 
Central manager's sched log contains:
-------------------------------------
3/22 16:49:32 (pid:2556) Activity on stashed negotiator socket
3/22 16:49:32 (pid:2556) Negotiating for owner:
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
3/22 16:49:32 (pid:2556) Out of requests - 1 reqs matched, 0 reqs idle
3/22 16:49:33 (pid:2556) Activity on stashed negotiator socket
3/22 16:49:33 (pid:2556) Negotiating for owner:
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
3/22 16:49:33 (pid:2556) Out of requests - 0 reqs matched, 0 reqs idle
3/22 16:49:35 (pid:2556) Inserting new attribute Scheduler into non-active cluster cid=19 acid=-1
3/22 16:49:37 (pid:2556) Starting add_shadow_birthdate(19.0)
3/22 16:49:37 (pid:2556) Started shadow for job 19.0 on "<131.242.63.162:1349>", (shadow pid = 2900)
3/22 16:49:37 (pid:2556) Sent ad to central manager for
jeffreysj@xxxxxxxxxxxxxxx
3/22 16:49:37 (pid:2556) Sent ad to 1 collectors for jeffreysj@xxxxxxxxxxxxxxx
3/22 16:49:38 (pid:2556) DaemonCore: Command received via TCP from host <131.242.63.124:2733>
3/22 16:49:38 (pid:2556) DaemonCore: received command 71003 (GIVE_MATCHES), calling handler (DedicatedSchedule
r::giveMatches)
3/22 16:49:40 (pid:2556) DaemonCore: Command received via UDP from host <131.242.63.124:2735>
3/22 16:49:40 (pid:2556) DaemonCore: received command 60011 (DC_NOP), calling handler (handle_nop())
3/22 16:49:40 (pid:2556) In DedicatedScheduler::reaper pid 2900 has status 4
3/22 16:49:40 (pid:2556) Shadow pid 2900 exited with status 4
3/22 16:49:40 (pid:2556) ERROR: Shadow exited with job exception code!
3/22 16:49:40 (pid:2556) DedicatedScheduler::deallocMatchRec
3/22 16:49:40 (pid:2556) DedicatedScheduler::deallocMatchRec
 
The central manager's shadow log also reports the error:
------------------------------------
3/22 16:49:37 DaemonCore: Command Socket at <131.242.63.124:2722>
3/22 16:49:37 Initializing a PARALLEL shadow for job 19.0
3/22 16:49:38 (19.0) (2900): Request to run on <131.242.63.162:1349> was ACCEPTED
3/22 16:49:40 (19.0) (2900): ERROR "Error from starter on nes15300.lands.resnet.qg: Create_Process(D:\condor-6.8.4\execute\dir_784\condor_exec.exe,\\indplly1\userdirs\JeffreySJ\Condor_Jobs\cpilog_minimal.exe, ...) failed" at line 643 in file ..\src\condor_shadow.V6.1\pseudo_ops.C
 
The start log contains some TCP "connection refused" errors.  Error 10061 (WSAECONNREFUSED)
means "No connection could be made because the target machine actively refused it."
I don't think this is the problem because I have tested some simple TCP client/server code running
between the central manager and execute machine and it works fine.
 
Execute machine's Start log contains:
-------------------------------------
3/22 16:49:39 Got universe "PARALLEL" (11) from request classad
3/22 16:49:39 State change: claim-activation protocol successful
3/22 16:49:39 Changing activity: Idle -> Busy
3/22 16:49:41 DaemonCore: Command received via TCP from host <131.242.63.124:2736>
3/22 16:49:41 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler)
3/22 16:49:41 Called deactivate_claim()
3/22 16:49:41 attempt to connect to <131.242.63.162:1373> failed: connect errno = 10061 connection refused.
3/22 16:49:41 ERROR: SECMAN:2003:TCP auth connection to <131.242.63.162:1373> failed
3/22 16:49:41 Send_Signal: ERROR Connect to <131.242.63.162:1373> failed.
3/22 16:49:41 Error sending signal to starter, errno = 0 (No error)
3/22 16:49:41 attempt to connect to <131.242.63.162:1373> failed: connect errno = 10061 connection refused.
3/22 16:49:41 ERROR: SECMAN:2003:TCP auth connection to <131.242.63.162:1373> failed
3/22 16:49:41 Send_Signal: ERROR Connect to <131.242.63.162:1373> failed.
3/22 16:49:41 DaemonCore: Command received via UDP from host <131.242.63.162:1383>
3/22 16:49:41 DaemonCore: received command 60011 (DC_NOP), calling handler (handle_nop())
3/22 16:49:41 Starter pid 784 exited with status 0
3/22 16:49:41 State change: starter exited
3/22 16:49:41 Changing activity: Busy -> Idle
 
 
 
My mp1script is:
----------------
universe = parallel
Executable = H:\Condor_Jobs\mp1script
machine_count = 1
Output = foo.out.$(NODE)
log = foo.log.$(CLUSTER)
error = foo.err.$(NODE)
arguments =  H:\Condor_Jobs\cpilog_minimal.exe
should_transfer_files = YES
transfer_input_files =  H:\Condor_Jobs\cpilog_minimal.exe
WhenToTransferOutput = ON_EXIT_OR_EVICT
queue 1
 
I have tried using relative and absolute paths to the various files specified in the submit script:eg. 
   mp1script
   H:\Condor_Jobs\mp1script
   \\indplly1\userdirs\JeffreySJ\Condor_Jobs\mp1script
but with no success.
 
I can manually run the job on the execute machine:
mpirun -np 1 cpilog_minimal.exe
so I don't think there is a problem with the MPI application
 
cheers
steve

************************************************************************

The information in this e-mail together with any attachments is

intended only for the person or entity to which it is addressed

and may contain confidential and/or privileged material.

Any form of review, disclosure, modification, distribution

and/or publication of this e-mail message is prohibited.

If you have received this message in error, you are asked to

inform the sender as quickly as possible and delete this message

and any copies of this message from your computer and/or your

computer system network.

************************************************************************