[Condor-users] problem returning output from multiple nodes of parallel universe, mpich2 on Windows 7

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

...

000 (006.000.000) 04/28 14:24:01 Job submitted from host: <10.171.2.232:49161>

...

014 (006.000.000) 04/28 14:24:25 Node 0 executing on host: <10.171.2.232:49162>

...

014 (006.000.001) 04/28 14:24:26 Node 1 executing on host: <10.171.2.232:49162>

...

001 (006.000.000) 04/28 14:24:26 Job executing on host: MPI_job

...

015 (006.000.000) 04/28 14:24:28 Node 0 terminated.

(1) Normal termination (return value 0)

Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage

593 - Run Bytes Sent By Node

86409 - Run Bytes Received By Node

593 - Total Bytes Sent By Node

86409 - Total Bytes Received By Node

...

005 (006.000.000) 04/28 14:24:28 Job terminated.

(1) Normal termination (return value 0)

Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage

593 - Run Bytes Sent By Job

172818 - Run Bytes Received By Job

593 - Total Bytes Sent By Job

172818 - Total Bytes Received By Job

...

04/28/11 14:24:21 (pid:2060) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxx

04/28/11 14:24:21 (pid:2060) Finished negotiating for DedicatedScheduler in local pool: 1 matched, 1 rejected

04/28/11 14:24:21 (pid:2060) Activity on stashed negotiator socket: <10.171.2.232:49325>

04/28/11 14:24:21 (pid:2060) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxx

04/28/11 14:24:21 (pid:2060) Completed REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxx <10.171.2.232:49162> for DedicatedScheduler

04/28/11 14:24:21 (pid:2060) Inserting new attribute Scheduler into non-active cluster cid=6 acid=-1

04/28/11 14:24:21 (pid:2060) Finished negotiating for DedicatedScheduler in local pool: 1 matched, 0 rejected

04/28/11 14:24:21 (pid:2060) Completed REQUEST_CLAIM to startd slot2@xxxxxxxxxxxxxxxxxxxxxxx <10.171.2.232:49162> for DedicatedScheduler

04/28/11 14:24:21 (pid:2060) Inserting new attribute Scheduler into non-active cluster cid=6 acid=-1

04/28/11 14:24:21 (pid:2060) Starting add_shadow_birthdate(6.0)

04/28/11 14:24:21 (pid:2060) Started shadow for job 6.0 on slot2@xxxxxxxxxxxxxxxxxxxxxxx <10.171.2.232:49162> for DedicatedScheduler, (shadow pid = 4960)

04/28/11 14:24:25 (pid:2060) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s

04/28/11 14:24:25 (pid:2060) Sent ad to central manager for mgrauer@*.kitwarein.com

04/28/11 14:24:25 (pid:2060) Sent ad to 1 collectors for mgrauer@*.kitwarein.com

04/28/11 14:24:28 (pid:2060) In DedicatedScheduler::reaper pid 4960 has status 100

04/28/11 14:24:28 (pid:2060) Shadow pid 4960 exited with status 100

04/28/11 14:24:28 (pid:2060) DedicatedScheduler::deallocMatchRec

04/28/11 14:24:26 Job 6.0 set to execute immediately

04/28/11 14:24:26 Starting a PARALLEL universe job with ID: 6.0

04/28/11 14:24:26 Tracking process family by login "condor-reuse-slot1"

04/28/11 14:24:26 IWD: C:\condor\execute\dir_5072

04/28/11 14:24:26 Input file: C:\condor\execute\dir_5072\cpiinput.txt

04/28/11 14:24:26 Output file: C:\condor\execute\dir_5072\out.1.log

04/28/11 14:24:26 Error file: C:\condor\execute\dir_5072\error.1.log

04/28/11 14:24:26 Renice expr "10" evaluated to 10

04/28/11 14:24:26 About to exec C:\condor\execute\dir_5072\condor_exec.bat cpi.exe

04/28/11 14:24:26 Executable is a batch file, running: "C:\Windows\system32\cmd.exe" /Q /C "C:\condor\execute\dir_5072\condor_exec.bat" cpi.exe

04/28/11 14:24:27 Create_Process succeeded, pid=4544

04/28/11 14:24:28 Got SIGQUIT. Performing fast shutdown.

04/28/11 14:24:28 ShutdownFast all jobs.

04/28/11 14:24:28 Got SIGTERM. Performing graceful shutdown.

04/28/11 14:24:28 ShutdownGraceful all jobs.

04/28/11 14:24:28 Process exited, pid=4544, status=-1073741510

04/28/11 14:24:29 condor_write(): Socket closed when trying to write 297 bytes to <10.171.2.232:49339>, fd is 540, errno=10054

04/28/11 14:24:29 Buf::write(): condor_write() failed

04/28/11 14:24:29 condor_write(): Socket closed when trying to write 93 bytes to <10.171.2.232:49339>, fd is 540, errno=10054

04/28/11 14:24:29 Buf::write(): condor_write() failed

04/28/11 14:24:29 Failed to send job exit status to shadow

04/28/11 14:24:29 JobExit() failed, waiting for job lease to expire or for a reconnect attempt

04/28/11 14:24:29 Returning from CStarter::JobReaper()

04/28/11 14:24:24 Initialized IO Proxy.

04/28/11 14:24:24 File transfer completed successfully.

04/28/11 14:24:25 Job 6.0 set to execute immediately

04/28/11 14:24:25 Starting a PARALLEL universe job with ID: 6.0

04/28/11 14:24:25 Tracking process family by login "condor-reuse-slot2"

04/28/11 14:24:25 IWD: C:\condor\execute\dir_4984

04/28/11 14:24:25 Input file: C:\condor\execute\dir_4984\cpiinput.txt

04/28/11 14:24:25 Output file: C:\condor\execute\dir_4984\out.0.log

04/28/11 14:24:25 Error file: C:\condor\execute\dir_4984\error.0.log

04/28/11 14:24:25 Renice expr "10" evaluated to 10

04/28/11 14:24:25 About to exec C:\condor\execute\dir_4984\condor_exec.bat cpi.exe

04/28/11 14:24:25 Executable is a batch file, running: "C:\Windows\system32\cmd.exe" /Q /C "C:\condor\execute\dir_4984\condor_exec.bat" cpi.exe

04/28/11 14:24:27 Create_Process succeeded, pid=4716

04/28/11 14:24:27 Process exited, pid=4716, status=0

04/28/11 14:24:28 Got SIGQUIT. Performing fast shutdown.

04/28/11 14:24:28 ShutdownFast all jobs.

04/28/11 14:24:28 condor_read() failed: recv() returned -1, errno = 10054 , reading 5 bytes from <10.171.2.232:49336>.

04/28/11 14:24:28 IO: Failed to read packet header

04/28/11 14:24:28 **** condor_starter (condor_STARTER) pid 4984 EXITING WITH STATUS 0

04/28/11 14:24:23 slot2: Remote job ID is 6.0

04/28/11 14:24:24 slot2: Got universe "PARALLEL" (11) from request classad

04/28/11 14:24:24 slot2: State change: claim-activation protocol successful

04/28/11 14:24:24 slot2: Changing activity: Idle -> Busy

04/28/11 14:24:24 slot1: Got activate_claim request from shadow (<10.171.2.232:49339>)

04/28/11 14:24:24 slot1: Remote job ID is 6.0

04/28/11 14:24:24 slot1: Got universe "PARALLEL" (11) from request classad

04/28/11 14:24:24 slot1: State change: claim-activation protocol successful

04/28/11 14:24:24 slot1: Changing activity: Idle -> Busy

04/28/11 14:24:28 slot1: Called deactivate_claim_forcibly()

04/28/11 14:24:28 slot2: Called deactivate_claim_forcibly()

04/28/11 14:24:28 slot2: Called deactivate_claim()

04/28/11 14:24:28 slot1: Called deactivate_claim()

04/28/11 14:24:28 Starter pid 4984 exited with status 0

04/28/11 14:24:29 slot2: State change: starter exited

04/28/11 14:24:29 slot2: Changing activity: Busy -> Idle

04/28/11 14:24:58 slot1: starter (pid 5072) is not responding to the request to hardkill its job. The startd will now directly hard kill the starter and all its decendents.

04/28/11 14:25:00 Starter pid 5072 exited with status 0

04/28/11 14:25:00 slot1: State change: starter exited

04/28/11 14:25:00 slot1: Changing activity: Busy -> Idle

04/28/11 14:24:22 Initializing a PARALLEL shadow for job 6.0

04/28/11 14:24:23 (6.0) (4960): Request to run on slot2@xxxxxxxxxxxxxxxxxxxxxxx <10.171.2.232:49162> was ACCEPTED

04/28/11 14:24:24 (6.0) (4960): Request to run on <10.171.2.232:49162> <10.171.2.232:49162> was ACCEPTED

04/28/11 14:24:28 (6.0) (4960): Job 6.0 terminated: exited with status 0

04/28/11 14:24:28 (6.0) (4960): Reporting job exit reason 100 and attempting to fetch new job.

04/28/11 14:24:28 (6.0) (4960): **** condor_shadow (condor_SHADOW) pid 4960 EXITING WITH STATUS 100

Mailing List Archives

Public Access

[Condor-users] problem returning output from multiple nodes of parallel universe, mpich2 on Windows 7