[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] HYPHY parallel MPI implementation on Condor

Date: Wed, 8 Oct 2008 09:04:12 +0900
From: Cecile_Garros@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [Condor-users] HYPHY parallel MPI implementation on Condor

Hi Javier,

You set:
export P4_RSHCOMMAND=$CONDOR_SSH
So when mpirun is used, condor_ssh is executed to connect remote processes.

Since your job is running fine when launched directly, i. e. without using Condor, a possibility is that the error occurs when the condor_ssh script is executed.

It could be that the options used to run with P4_RSHCOMMAND by HYPHYMPI are not supported by condor_ssh. I guess you can backup your current condor_ssh script and modify it for debugging purpose if necessary.

I hope this helps,

Best regards,

--
Cecile GARROS

IT Service Group
Advanced Technologies Division

--------------------------------
ARGO GRAPHICS INC.
www.argo-graph.co.jp
tel: 03(5641)2025
fax: 03(5641)2011
--------------------------------

Javier Forment Millet <jforment@xxxxxxxxxxxx>
送信者: condor-users-bounces@xxxxxxxxxxx

2008/10/07 19:39
Condor-Users Mail List へ返信してください

宛先: condor-users@xxxxxxxxxxx
cc:
件名: [Condor-users] HYPHY parallel MPI implementation on Condor

Hi, all... Is anyone running HYPHYMPI (http://www.hyphy.org) in a Condor cluster? I have a small Linux-Ubuntu cluster with MPI installed. I installed MPI version of HYPHY (HYPHYMPI). It runs fine when launched as a regular non-Condor MPI process: mpirun -v -np 10 -machinefile machines.list HYPHYMPI MPITest.bf But, when using Condor, it gives the following error: (...) running /home/condor/execute/dir_25957/HYPHYMPI on 10 LINUX ch_p4 processors Created /home/condor/execute/dir_25957/PI26038 [: 68: ==: unexpected operator [: 101: ==: unexpected operator Could not chdir to home directory /nonexistent: No such file or directory sh: -l: not found p0_26227: p4_error: Child process exited while making connection to remote process on xeon2: 0 p0_26227: (18.265625) net_send: could not write to fd=4, errno = 32 Condor runs MPI jobs with the mp1script script (see below), but the error arises when running the last sentence 'mpirun ...'. The problem seems to be related with running HYPHYMPI in the Condor environment, since other Condor jobs run fine, and runnning just the 'mpirun...' sentence using the MPITest.bf file shipped with the HYPHY distribution run also fine. Specifically, the problem could be related with the fact that Condor run processes in execution directories where it transfers the executables and input files needed, and where it writes the output files which will be back transferred when execution finishes. Maybe in the HYPHYMPI code there is a point where it tries to do a 'chdir' from the execution directory and it raises the error because it cannot do it? What does it mean the error sentence 'Could not chdir to home directory /nonexistent: No such file or directory'? Is it trying to change to a directory specified by an empty string? Does the error arise from not being able to change directory?, or does it arise from a wrong definition of the directory name to change to? I've done 'grep -r -i 'chdir' ./*' from the HYPHY_Source folder and I get 'Binary file ./HYPHYMPI matches', but it does not match with any of the source code files of the distribution, so maybe it comes from any library linked when compiling. It also could be a problem related with the communication between the nodes executing the MPI parallel processes and the node running the Condor process which launch the corresponding parallel MPI processes in the different execution nodes. I can see some messages which could be related to this in the Condor log files (see below). HYPHYMPI was compiling in Ubuntu 7.04 (server version) boxes, using libcurl3 (version 7.15.5-1ubuntu2.1), libssl0.9.8, libcrypto++5.2c2a, and libmpich1.0c2 (version 1.2.7-2). Any idea? Thanks in advance, and best wishes, Javier. ############## # testMPI.cmd Condor submit file ############## universe = parallel executable = mp1script arguments = HYPHYMPI MPITest.bf run_as_owner = true machine_count = 10 should_transfer_files = yes when_to_transfer_output = on_exit transfer_input_files = /home/jforment/install/HYPHY_Source/BatchFiles/MPITest.bf log = log.$(Process).txt output = out.$(NODE).txt error = out.$(NODE).txt +WantIOProxy = true queue ############# # mp1script # ############# #!/bin/sh _CONDOR_PROCNO=$_CONDOR_PROCNO _CONDOR_NPROCS=$_CONDOR_NPROCS echo "nprocs: $_CONDOR_NPROCS" echo "procno: $_CONDOR_PROCNO" # sshd.sh sets ssh without password using ssh-keygen CONDOR_LIBEXEC="/usr/local/condor/libexec" CONDOR_SSH=$CONDOR_LIBEXEC/condor_ssh SSHD_SH=$CONDOR_LIBEXEC/sshd.sh . $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS #$SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS # If not the head node, just sleep forever, to let the # sshds run if [ $_CONDOR_PROCNO -ne 0 ] then wait sshd_cleanup exit 0 fi EXECUTABLE=$1 shift # the binary is copied but the executable flag is cleared. # so the script have to take care of this fullexepath="/usr/local/bin/$EXECUTABLE" cp $fullexepath $_CONDOR_SCRATCH_DIR/ chmod +x $EXECUTABLE # Set this to the bin directory of MPICH installation MPDIR=/usr/lib/mpich/bin PATH=$MPDIR:.:$PATH export PATH export P4_RSHCOMMAND=$CONDOR_SSH CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact export CONDOR_CONTACT_FILE # The second field in the contact file is the machine name # that condor_ssh knows how to use sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines ## run the actual mpijob echo "nprocs: $_CONDOR_NPROCS" mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@ sshd_cleanup rm $EXECUTABLE exit $? ####################### # SchedLog extract (at the submit node, not an execution node) ####################### (...) 10/6 13:09:16 (pid:6354) Sent ad to central manager for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:09:16 (pid:6354) Sent ad to 1 collectors for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:13:19 (pid:6354) DaemonCore: Command received via UDP from host <192.168.1.21:47086> 10/6 13:13:19 (pid:6354) DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator) 10/6 13:13:19 (pid:6354) Sent ad to central manager for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:13:19 (pid:6354) Sent ad to 1 collectors for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:13:19 (pid:6354) Called reschedule_negotiator() 10/6 13:13:21 (pid:6354) Inserting new attribute Scheduler into non-active cluster cid=2221 acid=-1 10/6 13:18:19 (pid:6354) Sent ad to central manager for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:18:19 (pid:6354) Sent ad to 1 collectors for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:18:20 (pid:6354) Activity on stashed negotiator socket 10/6 13:18:20 (pid:6354) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:18:20 (pid:6354) Out of requests - 10 reqs matched, 0 reqs idle 10/6 13:18:21 (pid:6354) Inserting new attribute Scheduler into non-active cluster cid=2221 acid=-1 10/6 13:18:23 (pid:6354) Starting add_shadow_birthdate(2221.0) 10/6 13:18:23 (pid:6354) Started shadow for job 2221.0 on "<192.168.1.25:33961>", (shadow pid = 15015) 10/6 13:18:24 (pid:6354) Sent ad to central manager for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:18:24 (pid:6354) Sent ad to 1 collectors for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:18:24 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.21:49410> 10/6 13:18:24 (pid:6354) DaemonCore: received command 71003 (GIVE_MATCHES), calling handler (DedicatedScheduler::giveMatches) 10/6 13:18:51 (pid:6354) In DedicatedScheduler::reaper pid 15015 has status 25600 10/6 13:18:51 (pid:6354) Shadow pid 15015 exited with status 100 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec 10/6 13:23:24 (pid:6354) Sent ad to central manager for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:23:24 (pid:6354) Sent ad to 1 collectors for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:28:24 (pid:6354) Sent ad to central manager for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:28:24 (pid:6354) Sent ad to 1 collectors for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:28:51 (pid:6354) Resource vm1@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 600 seconds, limit is 600, releasing 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 600 seconds, limit is 600, releasing 10/6 13:28:51 (pid:6354) Resource vm1@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 600 seconds, limit is 600, releasing 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 600 seconds, limit is 600, releasing 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 600 seconds, limit is 600, releasing 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 600 seconds, limit is 600, releasing 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 600 seconds, limit is 600, releasing 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 600 seconds, limit is 600, releasing 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 600 seconds, limit is 600, releasing 10/6 13:28:51 (pid:6354) Resource vm1@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 600 seconds, limit is 600, releasing 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.23:36660> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.23:36660> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.22:48284> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.22:48284> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.24:37331> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.24:37331> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.23:56530> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.23:56530> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.26:36217> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.26:36217> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.27:39719> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.27:39719> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.28:58522> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.28:58522> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.24:59889> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.24:59889> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.22:34057> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.22:34057> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host <192.168.1.25:43167> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.25:43167> 10/6 13:33:24 (pid:6354) Sent ad to central manager for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:33:24 (pid:6354) Sent ad to 1 collectors for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:38:24 (pid:6354) Sent ad to central manager for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:38:24 (pid:6354) Sent ad to 1 collectors for jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx (...) #################### # ShadowLog extract (at submit node, not an execution node) #################### (...) 10/6 13:18:23 ****************************************************** 10/6 13:18:23 ** condor_shadow (CONDOR_SHADOW) STARTING UP 10/6 13:18:23 ** /usr/local/condor-6.8.6/sbin/condor_shadow 10/6 13:18:23 ** $CondorVersion: 6.8.6 Sep 13 2007 $ 10/6 13:18:23 ** $CondorPlatform: X86_64-LINUX_RHEL3 $ 10/6 13:18:23 ** PID = 15015 10/6 13:18:23 ** Log last touched 10/4 18:59:18 10/6 13:18:23 ****************************************************** 10/6 13:18:23 Using config source: /home/condor/condor_config 10/6 13:18:23 Using local config sources: 10/6 13:18:23 /home/condor/condor_config.local 10/6 13:18:23 DaemonCore: Command Socket at <192.168.1.21:45720> 10/6 13:18:23 Initializing a PARALLEL shadow for job 2221.0 10/6 13:18:24 (2221.0) (15015): Request to run on <192.168.1.25:33961> was ACCEPTED 10/6 13:18:24 (2221.0) (15015): Request to run on <192.168.1.22:57556> was ACCEPTED 10/6 13:18:24 (2221.0) (15015): Request to run on <192.168.1.22:57556> was ACCEPTED 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.23:51800> was ACCEPTED 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.23:51800> was ACCEPTED 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.24:35810> was ACCEPTED 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.24:35810> was ACCEPTED 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.28:50572> was ACCEPTED 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.26:40791> was ACCEPTED 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.27:48119> was ACCEPTED 10/6 13:18:51 (2221.0) (15015): Job 2221.0 terminated: exited with status 0 10/6 13:18:51 (2221.0) (15015): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100 ################### # StartLog extract (at node running the main Condor process, that is 'mp1script', and launching the MPI processes to be run in the other nodes) ################### (...) 10/6 10:56:45 vm1: Changing activity: Benchmarking -> Idle 10/6 13:24:24 DaemonCore: Command received via UDP from host <192.168.1.22:37956> 10/6 13:24:24 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info) 10/6 13:24:24 vm2: match_info called 10/6 13:24:24 vm2: Received match <192.168.1.25:33961>#1222442286#232#... 10/6 13:24:24 vm2: State change: match notification protocol successful 10/6 13:24:24 vm2: Changing state: Unclaimed -> Matched 10/6 13:24:24 DaemonCore: Command received via TCP from host <192.168.1.21:51134> 10/6 13:24:24 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim) 10/6 13:24:24 vm2: Request accepted. 10/6 13:24:24 vm2: Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:24:24 vm2: State change: claiming protocol successful 10/6 13:24:24 vm2: Changing state: Matched -> Claimed 10/6 13:24:29 DaemonCore: Command received via TCP from host <192.168.1.21:44154> 10/6 13:24:29 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 10/6 13:24:29 vm2: Got activate_claim request from shadow (<192.168.1.21:44154>) 10/6 13:24:29 vm2: Remote job ID is 2221.0 10/6 13:24:29 vm2: Got universe "PARALLEL" (11) from request classad 10/6 13:24:29 vm2: State change: claim-activation protocol successful 10/6 13:24:29 vm2: Changing activity: Idle -> Busy 10/6 13:24:55 DaemonCore: Command received via TCP from host <192.168.1.21:43016> 10/6 13:24:55 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 10/6 13:24:55 vm2: Called deactivate_claim_forcibly() 10/6 13:24:55 Starter pid 25957 exited with status 0 10/6 13:24:55 vm2: State change: starter exited 10/6 13:24:55 vm2: Changing activity: Busy -> Idle 10/6 13:24:55 DaemonCore: Command received via TCP from host <192.168.1.21:59999> 10/6 13:24:55 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler) 10/6 13:24:55 vm2: Called deactivate_claim() 10/6 13:34:55 DaemonCore: Command received via TCP from host <192.168.1.21:40275> 10/6 13:34:55 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim) 10/6 13:34:55 vm2: State change: received RELEASE_CLAIM command 10/6 13:34:55 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating 10/6 13:34:55 vm2: State change: No preempting claim, returning to owner 10/6 13:34:55 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle 10/6 13:34:55 vm2: State change: IS_OWNER is false 10/6 13:34:55 vm2: Changing state: Owner -> Unclaimed 10/6 14:56:45 State change: RunBenchmarks is TRUE 10/6 14:56:45 vm1: Changing activity: Idle -> Benchmarking 10/6 14:56:49 State change: benchmarks completed 10/6 14:56:49 vm1: Changing activity: Benchmarking -> Idle (...) ########################## # StarterLog.vm2 extract (at virtual machine 2 in the node running the main Condor process, that is 'mp1script', and launching the MPI processes to be run in the other nodes) ########################## (...) 10/6 13:24:29 ****************************************************** 10/6 13:24:29 ** condor_starter (CONDOR_STARTER) STARTING UP 10/6 13:24:29 ** /usr/local/condor-6.8.6/sbin/condor_starter 10/6 13:24:29 ** $CondorVersion: 6.8.6 Sep 13 2007 $ 10/6 13:24:29 ** $CondorPlatform: X86_64-LINUX_RHEL3 $ 10/6 13:24:29 ** PID = 25957 10/6 13:24:29 ** Log last touched 10/4 18:08:06 10/6 13:24:29 ****************************************************** 10/6 13:24:29 Using config source: /home/condor/condor_config 10/6 13:24:29 Using local config sources: 10/6 13:24:29 /home/condor/condor_config.local 10/6 13:24:29 DaemonCore: Command Socket at <192.168.1.25:54234> 10/6 13:24:29 Done setting resource limits 10/6 13:24:29 Communicating with shadow <192.168.1.21:45720> 10/6 13:24:29 Submitting machine is "bioxeon.ibmcp-cluster.upv.es" 10/6 13:24:29 Job has WantIOProxy=true 10/6 13:24:29 Initialized IO Proxy. 10/6 13:24:29 File transfer completed successfully. 10/6 13:24:30 Starting a PARALLEL universe job with ID: 2221.0 10/6 13:24:30 IWD: /home/condor/execute/dir_25957 10/6 13:24:30 Output file: /home/condor/execute/dir_25957/out.0.txt 10/6 13:24:30 Error file: /home/condor/execute/dir_25957/out.0.txt 10/6 13:24:30 About to exec /home/condor/execute/dir_25957/condor_exec.exe HYPHYMPI MPITest.bf 10/6 13:24:30 Create_Process succeeded, pid=25959 10/6 13:24:31 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:31 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:34 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:34 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:34 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:34 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:35 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:35 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:36 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:36 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25 10/6 13:24:55 Process exited, pid=25959, status=0 10/6 13:24:55 Got SIGQUIT. Performing fast shutdown. 10/6 13:24:55 ShutdownFast all jobs. 10/6 13:24:55 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0 ############### # StartLog (at node executing one of the MPI parallel processes) ############### 10/6 13:18:01 DaemonCore: Command received via UDP from host <192.168.1.22:37956> 10/6 13:18:01 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info) 10/6 13:18:01 vm2: match_info called 10/6 13:18:01 vm2: Received match <192.168.1.22:57556>#1222440814#310#... 10/6 13:18:01 vm2: State change: match notification protocol successful 10/6 13:18:01 vm2: Changing state: Unclaimed -> Matched 10/6 13:18:01 DaemonCore: Command received via UDP from host <192.168.1.22:37956> 10/6 13:18:01 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info) 10/6 13:18:01 vm1: match_info called 10/6 13:18:01 vm1: Received match <192.168.1.22:57556>#1222440814#308#... 10/6 13:18:01 vm1: State change: match notification protocol successful 10/6 13:18:01 vm1: Changing state: Unclaimed -> Matched 10/6 13:18:01 DaemonCore: Command received via TCP from host <192.168.1.21:46001> 10/6 13:18:01 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim) 10/6 13:18:01 vm2: Request accepted. 10/6 13:18:01 vm2: Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:18:01 vm2: State change: claiming protocol successful 10/6 13:18:01 vm2: Changing state: Matched -> Claimed 10/6 13:18:01 DaemonCore: Command received via TCP from host <192.168.1.21:60616> 10/6 13:18:01 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim) 10/6 13:18:01 vm1: Request accepted. 10/6 13:18:01 vm1: Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 10/6 13:18:01 vm1: State change: claiming protocol successful 10/6 13:18:01 vm1: Changing state: Matched -> Claimed 10/6 13:18:05 DaemonCore: Command received via TCP from host <192.168.1.21:42541> 10/6 13:18:05 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 10/6 13:18:05 vm1: Got activate_claim request from shadow (<192.168.1.21:42541>) 10/6 13:18:05 vm1: Remote job ID is 2221.0 10/6 13:18:05 vm1: Got universe "PARALLEL" (11) from request classad 10/6 13:18:05 vm1: State change: claim-activation protocol successful 10/6 13:18:05 vm1: Changing activity: Idle -> Busy 10/6 13:18:05 DaemonCore: Command received via TCP from host <192.168.1.21:52511> 10/6 13:18:05 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 10/6 13:18:05 vm2: Got activate_claim request from shadow (<192.168.1.21:52511>) 10/6 13:18:05 vm2: Remote job ID is 2221.0 10/6 13:18:05 vm2: Got universe "PARALLEL" (11) from request classad 10/6 13:18:05 vm2: State change: claim-activation protocol successful 10/6 13:18:05 vm2: Changing activity: Idle -> Busy 10/6 13:18:32 DaemonCore: Command received via TCP from host <192.168.1.21:34804> 10/6 13:18:32 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 10/6 13:18:32 vm1: Called deactivate_claim_forcibly() 10/6 13:18:32 DaemonCore: Command received via TCP from host <192.168.1.21:42458> 10/6 13:18:32 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 10/6 13:18:32 vm2: Called deactivate_claim_forcibly() 10/6 13:18:32 Starter pid 25175 exited with status 0 10/6 13:18:32 vm1: State change: starter exited 10/6 13:18:32 vm1: Changing activity: Busy -> Idle 10/6 13:18:32 Starter pid 25176 exited with status 0 10/6 13:18:32 vm2: State change: starter exited 10/6 13:18:32 vm2: Changing activity: Busy -> Idle 10/6 13:18:32 DaemonCore: Command received via TCP from host <192.168.1.21:33267> 10/6 13:18:32 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler) 10/6 13:18:32 vm1: Called deactivate_claim() 10/6 13:18:32 DaemonCore: Command received via TCP from host <192.168.1.21:37421> 10/6 13:18:32 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler) 10/6 13:18:32 vm2: Called deactivate_claim() 10/6 13:28:31 DaemonCore: Command received via TCP from host <192.168.1.21:43553> 10/6 13:28:31 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim) 10/6 13:28:31 vm2: State change: received RELEASE_CLAIM command 10/6 13:28:31 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating 10/6 13:28:31 vm2: State change: No preempting claim, returning to owner 10/6 13:28:31 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle 10/6 13:28:31 vm2: State change: IS_OWNER is false 10/6 13:28:31 vm2: Changing state: Owner -> Unclaimed 10/6 13:28:31 DaemonCore: Command received via TCP from host <192.168.1.21:48518> 10/6 13:28:31 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim) 10/6 13:28:31 vm1: State change: received RELEASE_CLAIM command 10/6 13:28:31 vm1: Changing state and activity: Claimed/Idle -> Preempting/Vacating 10/6 13:28:31 vm1: State change: No preempting claim, returning to owner 10/6 13:28:31 vm1: Changing state and activity: Preempting/Vacating -> Owner/Idle
10/6 13:28:31 vm1: State change: IS_OWNER is false 10/6 13:28:31 vm1: Changing state: Owner -> Unclaimed (...) ################## # StarterLog.vm2 (at virtual machine 2 of node executing one of the MPI parallel processes) ################## 10/4 18:59:01 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0 10/6 13:18:05 ****************************************************** 10/6 13:18:05 ** condor_starter (CONDOR_STARTER) STARTING UP 10/6 13:18:05 ** /usr/local/condor-6.8.6/sbin/condor_starter 10/6 13:18:05 ** $CondorVersion: 6.8.6 Sep 13 2007 $ 10/6 13:18:05 ** $CondorPlatform: X86_64-LINUX_RHEL3 $ 10/6 13:18:05 ** PID = 25176 10/6 13:18:05 ** Log last touched 10/4 18:59:01 10/6 13:18:05 ****************************************************** 10/6 13:18:05 Using config source: /home/condor/condor_config 10/6 13:18:05 Using local config sources: 10/6 13:18:05 /home/condor/condor_config.local 10/6 13:18:05 DaemonCore: Command Socket at <192.168.1.22:42909> 10/6 13:18:05 Done setting resource limits 10/6 13:18:05 Communicating with shadow <192.168.1.21:45720> 10/6 13:18:05 Submitting machine is "bioxeon.ibmcp-cluster.upv.es" 10/6 13:18:05 Job has WantIOProxy=true 10/6 13:18:05 Initialized IO Proxy. 10/6 13:18:05 File transfer completed successfully. 10/6 13:18:06 Starting a PARALLEL universe job with ID: 2221.0 10/6 13:18:06 IWD: /home/condor/execute/dir_25176 10/6 13:18:06 Output file: /home/condor/execute/dir_25176/out.2.txt 10/6 13:18:06 Error file: /home/condor/execute/dir_25176/out.2.txt 10/6 13:18:06 About to exec /home/condor/execute/dir_25176/condor_exec.exe HYPHYMPI MPITest.bf 10/6 13:18:06 Create_Process succeeded, pid=25180 10/6 13:18:08 IOProxy: accepting connection from 192.168.1.22 10/6 13:18:08 IOProxyHandler: closing connection to 192.168.1.22 10/6 13:18:12 IOProxy: accepting connection from 192.168.1.22 10/6 13:18:12 IOProxyHandler: closing connection to 192.168.1.22 10/6 13:18:32 Got SIGQUIT. Performing fast shutdown. 10/6 13:18:32 ShutdownFast all jobs. 10/6 13:18:32 Process exited, pid=25180, status=0 10/6 13:18:32 condor_write(): Socket closed when trying to write 260 bytes to <192.168.1.21:52511>, fd is 5 10/6 13:18:32 Buf::write(): condor_write() failed 10/6 13:18:32 Failed to send job exit status to shadow 10/6 13:18:32 JobExit() failed, waiting for job lease to expire or for a reconnect attempt 10/6 13:18:32 Last process exited, now Starter is exiting 10/6 13:18:32 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0 -- Javier Forment Millet Instituto de Biolog?a Celular y Molecular de Plantas (IBMCP) CSIC-UPV Ciudad Polit?cnica de la Innovaci?n (CPI) Edificio 8 E, Escalera 7 Puerta E Calle Ing. Fausto Elio s/n. 46022 Valencia, Spain Tlf.:+34-96-3877858 FAX: +34-96-3877859 jforment@xxxxxxxxxxxx _______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visitinghttps://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/

Follow-Ups:
- Re: [Condor-users] HYPHY parallel MPI implementation on Condor
  - From: Javier Forment Millet

References:
- [Condor-users] HYPHY parallel MPI implementation on Condor
  - From: Javier Forment Millet

Prev by Date: [Condor-users] condor_compile + gfortran
Next by Date: Re: [Condor-users] InitMaster: MASTER not Specified
Previous by thread: [Condor-users] HYPHY parallel MPI implementation on Condor
Next by thread: Re: [Condor-users] HYPHY parallel MPI implementation on Condor
Index(es):
- Date
- Thread