[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] problem returning output from multiple nodes of parallel universe, mpich2 on Windows 7



I'm trying to setup a Condor 7.6 pool to be used with MPICH2 1.3.2.p1 (using the parallel universe) on Windows 7.

I followed the setup listed here:

https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows


As a test, I'd like to get an mpi example working on multiple cpus on the same machine (this machine is my central collector, dedicated scheduler, submit and execute node).  I can get a single cpu version (machine_count = 1) of this working, but I have problems when attempting 2 cpus (machine_count = 2).

Node 0 appears correct, and has its output/error files copied back, but Node 1 doesn't have its output/error files copied back. 

From the "work.#pArAlLeLnOdE#" log file, it seems like Node 1 starts executing, but there is no trace of the actual execution of Node 1.  I'm also including relevant output from other logs in case it is helpful.


Any guidance or help is appreciated.




work.#pArAlLeLnOdE#.log:

:

...
000 (006.000.000) 04/28 14:24:01 Job submitted from host: <10.171.2.232:49161>
...
014 (006.000.000) 04/28 14:24:25 Node 0 executing on host: <10.171.2.232:49162>
...
014 (006.000.001) 04/28 14:24:26 Node 1 executing on host: <10.171.2.232:49162>
...
001 (006.000.000) 04/28 14:24:26 Job executing on host: MPI_job
...
015 (006.000.000) 04/28 14:24:28 Node 0 terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
593  -  Run Bytes Sent By Node
86409  -  Run Bytes Received By Node
593  -  Total Bytes Sent By Node
86409  -  Total Bytes Received By Node
...
005 (006.000.000) 04/28 14:24:28 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
593  -  Run Bytes Sent By Job
172818  -  Run Bytes Received By Job
593  -  Total Bytes Sent By Job
172818  -  Total Bytes Received By Job
...


SchedLog:

04/28/11 14:24:21 (pid:2060) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxx
04/28/11 14:24:21 (pid:2060) Finished negotiating for DedicatedScheduler in local pool: 1 matched, 1 rejected
04/28/11 14:24:21 (pid:2060) Activity on stashed negotiator socket: <10.171.2.232:49325>
04/28/11 14:24:21 (pid:2060) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxx
04/28/11 14:24:21 (pid:2060) Completed REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxx <10.171.2.232:49162> for DedicatedScheduler
04/28/11 14:24:21 (pid:2060) Inserting new attribute Scheduler into non-active cluster cid=6 acid=-1
04/28/11 14:24:21 (pid:2060) Finished negotiating for DedicatedScheduler in local pool: 1 matched, 0 rejected
04/28/11 14:24:21 (pid:2060) Completed REQUEST_CLAIM to startd slot2@xxxxxxxxxxxxxxxxxxxxxxx <10.171.2.232:49162> for DedicatedScheduler
04/28/11 14:24:21 (pid:2060) Inserting new attribute Scheduler into non-active cluster cid=6 acid=-1
04/28/11 14:24:21 (pid:2060) Starting add_shadow_birthdate(6.0)
04/28/11 14:24:21 (pid:2060) Started shadow for job 6.0 on slot2@xxxxxxxxxxxxxxxxxxxxxxx <10.171.2.232:49162> for DedicatedScheduler, (shadow pid = 4960)
04/28/11 14:24:25 (pid:2060) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/28/11 14:24:25 (pid:2060) Sent ad to central manager for mgrauer@*.kitwarein.com
04/28/11 14:24:25 (pid:2060) Sent ad to 1 collectors for mgrauer@*.kitwarein.com
04/28/11 14:24:28 (pid:2060) In DedicatedScheduler::reaper pid 4960 has status 100
04/28/11 14:24:28 (pid:2060) Shadow pid 4960 exited with status 100
04/28/11 14:24:28 (pid:2060) DedicatedScheduler::deallocMatchRec
04/28/11 14:24:28 (pid:2060) DedicatedScheduler::deallocMatchRec
04/28/11 14:24:28 (pid:2060) DedicatedScheduler::deallocMatchRec
04/28/11 14:24:28 (pid:2060) DedicatedScheduler::deallocMatchRec
04/28/11 14:24:28 (pid:2060) DedicatedScheduler::deallocMatchRec

StarterLog.slot1:

04/28/11 14:24:26 Job 6.0 set to execute immediately
04/28/11 14:24:26 Starting a PARALLEL universe job with ID: 6.0
04/28/11 14:24:26 Tracking process family by login "condor-reuse-slot1"
04/28/11 14:24:26 IWD: C:\condor\execute\dir_5072
04/28/11 14:24:26 Input file: C:\condor\execute\dir_5072\cpiinput.txt
04/28/11 14:24:26 Output file: C:\condor\execute\dir_5072\out.1.log
04/28/11 14:24:26 Error file: C:\condor\execute\dir_5072\error.1.log
04/28/11 14:24:26 Renice expr "10" evaluated to 10
04/28/11 14:24:26 About to exec C:\condor\execute\dir_5072\condor_exec.bat cpi.exe
04/28/11 14:24:26 Executable is a batch file, running: "C:\Windows\system32\cmd.exe" /Q /C "C:\condor\execute\dir_5072\condor_exec.bat" cpi.exe
04/28/11 14:24:27 Create_Process succeeded, pid=4544
04/28/11 14:24:28 Got SIGQUIT.  Performing fast shutdown.
04/28/11 14:24:28 ShutdownFast all jobs.
04/28/11 14:24:28 Got SIGTERM. Performing graceful shutdown.
04/28/11 14:24:28 ShutdownGraceful all jobs.
04/28/11 14:24:28 Process exited, pid=4544, status=-1073741510
04/28/11 14:24:29 condor_write(): Socket closed when trying to write 297 bytes to <10.171.2.232:49339>, fd is 540, errno=10054 
04/28/11 14:24:29 Buf::write(): condor_write() failed
04/28/11 14:24:29 condor_write(): Socket closed when trying to write 93 bytes to <10.171.2.232:49339>, fd is 540, errno=10054 
04/28/11 14:24:29 Buf::write(): condor_write() failed
04/28/11 14:24:29 Failed to send job exit status to shadow
04/28/11 14:24:29 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
04/28/11 14:24:29 Returning from CStarter::JobReaper()


StarterLog.slot2:

04/28/11 14:24:24 Initialized IO Proxy.
04/28/11 14:24:24 File transfer completed successfully.
04/28/11 14:24:25 Job 6.0 set to execute immediately
04/28/11 14:24:25 Starting a PARALLEL universe job with ID: 6.0
04/28/11 14:24:25 Tracking process family by login "condor-reuse-slot2"
04/28/11 14:24:25 IWD: C:\condor\execute\dir_4984
04/28/11 14:24:25 Input file: C:\condor\execute\dir_4984\cpiinput.txt
04/28/11 14:24:25 Output file: C:\condor\execute\dir_4984\out.0.log
04/28/11 14:24:25 Error file: C:\condor\execute\dir_4984\error.0.log
04/28/11 14:24:25 Renice expr "10" evaluated to 10
04/28/11 14:24:25 About to exec C:\condor\execute\dir_4984\condor_exec.bat cpi.exe
04/28/11 14:24:25 Executable is a batch file, running: "C:\Windows\system32\cmd.exe" /Q /C "C:\condor\execute\dir_4984\condor_exec.bat" cpi.exe
04/28/11 14:24:27 Create_Process succeeded, pid=4716
04/28/11 14:24:27 Process exited, pid=4716, status=0
04/28/11 14:24:28 Got SIGQUIT.  Performing fast shutdown.
04/28/11 14:24:28 ShutdownFast all jobs.
04/28/11 14:24:28 condor_read() failed: recv() returned -1, errno = 10054 , reading 5 bytes from <10.171.2.232:49336>.
04/28/11 14:24:28 IO: Failed to read packet header
04/28/11 14:24:28 **** condor_starter (condor_STARTER) pid 4984 EXITING WITH STATUS 0


StartLog:

04/28/11 14:24:23 slot2: Remote job ID is 6.0
04/28/11 14:24:24 slot2: Got universe "PARALLEL" (11) from request classad
04/28/11 14:24:24 slot2: State change: claim-activation protocol successful
04/28/11 14:24:24 slot2: Changing activity: Idle -> Busy
04/28/11 14:24:24 slot1: Got activate_claim request from shadow (<10.171.2.232:49339>)
04/28/11 14:24:24 slot1: Remote job ID is 6.0
04/28/11 14:24:24 slot1: Got universe "PARALLEL" (11) from request classad
04/28/11 14:24:24 slot1: State change: claim-activation protocol successful
04/28/11 14:24:24 slot1: Changing activity: Idle -> Busy
04/28/11 14:24:28 slot1: Called deactivate_claim_forcibly()
04/28/11 14:24:28 slot2: Called deactivate_claim_forcibly()
04/28/11 14:24:28 slot2: Called deactivate_claim()
04/28/11 14:24:28 slot1: Called deactivate_claim()
04/28/11 14:24:28 Starter pid 4984 exited with status 0
04/28/11 14:24:29 slot2: State change: starter exited
04/28/11 14:24:29 slot2: Changing activity: Busy -> Idle
04/28/11 14:24:58 slot1: starter (pid 5072) is not responding to the request to hardkill its job.  The startd will now directly hard kill the starter and all its decendents.
04/28/11 14:25:00 Starter pid 5072 exited with status 0
04/28/11 14:25:00 slot1: State change: starter exited
04/28/11 14:25:00 slot1: Changing activity: Busy -> Idle

ShadowLog:

04/28/11 14:24:22 Initializing a PARALLEL shadow for job 6.0
04/28/11 14:24:23 (6.0) (4960): Request to run on slot2@xxxxxxxxxxxxxxxxxxxxxxx <10.171.2.232:49162> was ACCEPTED
04/28/11 14:24:24 (6.0) (4960): Request to run on <10.171.2.232:49162> <10.171.2.232:49162> was ACCEPTED
04/28/11 14:24:28 (6.0) (4960): Job 6.0 terminated: exited with status 0
04/28/11 14:24:28 (6.0) (4960): Reporting job exit reason 100 and attempting to fetch new job.
04/28/11 14:24:28 (6.0) (4960): **** condor_shadow (condor_SHADOW) pid 4960 EXITING WITH STATUS 100




--
Thanks,
Michael Grauer