[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] HYPHY parallel MPI implementation on Condor




Hi Javier,


You set:
        export P4_RSHCOMMAND=$CONDOR_SSH
So when mpirun is used, condor_ssh is executed to connect remote processes.

Since your job is running fine when launched directly, i. e. without using Condor, a possibility is that the error occurs when the condor_ssh script is executed.

It could be that the options used to run with P4_RSHCOMMAND by HYPHYMPI are not supported by condor_ssh. I guess you can backup your current condor_ssh script and modify it for debugging purpose if necessary.

I hope this helps,


Best regards,

--
Cecile GARROS

IT Service Group
Advanced Technologies Division

--------------------------------
ARGO GRAPHICS INC.
www.argo-graph.co.jp
tel: 03(5641)2025
fax: 03(5641)2011
--------------------------------



Javier Forment Millet <jforment@xxxxxxxxxxxx>
送信者: condor-users-bounces@xxxxxxxxxxx

2008/10/07 19:39
Condor-Users Mail List へ返信してください

       
        宛先:        condor-users@xxxxxxxxxxx
        cc:        
        件名:        [Condor-users] HYPHY parallel MPI implementation on Condor



Hi, all... Is anyone running HYPHYMPI (http://www.hyphy.org) in a Condor
cluster? I have a small Linux-Ubuntu cluster with MPI installed. I installed
MPI version of HYPHY (HYPHYMPI). It runs fine when launched as a regular
non-Condor MPI process:

mpirun -v -np 10 -machinefile machines.list HYPHYMPI MPITest.bf

But, when using Condor, it gives the following error:

(...)
running /home/condor/execute/dir_25957/HYPHYMPI on 10 LINUX ch_p4 processors
Created /home/condor/execute/dir_25957/PI26038
[: 68: ==: unexpected operator
[: 101: ==: unexpected operator
Could not chdir to home directory /nonexistent: No such file or directory
sh: -l: not found
p0_26227:  p4_error: Child process exited while making connection to remote
process on xeon2: 0
p0_26227: (18.265625) net_send: could not write to fd=4, errno = 32



Condor runs MPI jobs with the mp1script script (see below), but the error arises
when running the last sentence 'mpirun ...'.

The problem seems to be related with running HYPHYMPI in the Condor environment,
since other Condor jobs run fine, and runnning just the 'mpirun...' sentence
using the MPITest.bf file shipped with the HYPHY distribution run also fine.
Specifically, the problem could be related with the fact that Condor run
processes in execution directories where it transfers the executables and input
files needed, and where it writes the output files which will be back
transferred when execution finishes.

Maybe in the HYPHYMPI code there is a point where it tries to do a 'chdir' from
the execution directory and it raises the error because it cannot do it? What
does it mean the error sentence 'Could not chdir to home directory
/nonexistent: No such file or directory'? Is it trying to change to a directory
specified by an empty string? Does the error arise from not being able to change
directory?, or does it arise from a wrong definition of the directory name to
change to? I've done 'grep -r -i 'chdir' ./*' from the HYPHY_Source folder and
I get 'Binary file ./HYPHYMPI matches', but it does not match with any of the
source code files of the distribution, so maybe it comes from any library
linked when compiling.

It also could be a problem related with the communication between the nodes
executing the MPI parallel processes and the node running the Condor process
which launch the corresponding parallel MPI processes in the different
execution nodes. I can see some messages which could be related to this in the
Condor log files (see below).

HYPHYMPI was compiling in Ubuntu 7.04 (server version) boxes, using libcurl3
(version 7.15.5-1ubuntu2.1), libssl0.9.8, libcrypto++5.2c2a, and libmpich1.0c2
(version 1.2.7-2).

Any idea?

Thanks in advance, and best wishes,

Javier.


##############
# testMPI.cmd Condor submit file
##############

universe = parallel
executable = mp1script
arguments = HYPHYMPI MPITest.bf
run_as_owner = true
machine_count = 10
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = /home/jforment/install/HYPHY_Source/BatchFiles/MPITest.bf
log = log.$(Process).txt
output = out.$(NODE).txt
error = out.$(NODE).txt
+WantIOProxy = true
queue



#############
# mp1script #
#############

#!/bin/sh

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS

echo "nprocs: $_CONDOR_NPROCS"
echo "procno: $_CONDOR_PROCNO"

# sshd.sh sets ssh without password using ssh-keygen
CONDOR_LIBEXEC="/usr/local/condor/libexec"
CONDOR_SSH=$CONDOR_LIBEXEC/condor_ssh
SSHD_SH=$CONDOR_LIBEXEC/sshd.sh

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS
#$SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS

# If not the head node, just sleep forever, to let the
# sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
              wait
              sshd_cleanup
              exit 0
fi

EXECUTABLE=$1
shift

# the binary is copied but the executable flag is cleared.
# so the script have to take care of this
fullexepath="/usr/local/bin/$EXECUTABLE"
cp $fullexepath $_CONDOR_SCRATCH_DIR/
chmod +x $EXECUTABLE

# Set this to the bin directory of MPICH installation
MPDIR=/usr/lib/mpich/bin
PATH=$MPDIR:.:$PATH
export PATH

export P4_RSHCOMMAND=$CONDOR_SSH

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE

# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

## run the actual mpijob
echo "nprocs: $_CONDOR_NPROCS"
mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@

sshd_cleanup

rm $EXECUTABLE

exit $?


#######################
# SchedLog extract (at the submit node, not an execution node)
#######################

(...)
10/6 13:09:16 (pid:6354) Sent ad to central manager for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:09:16 (pid:6354) Sent ad to 1 collectors for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:13:19 (pid:6354) DaemonCore: Command received via UDP from host
<192.168.1.21:47086>
10/6 13:13:19 (pid:6354) DaemonCore: received command 421 (RESCHEDULE), calling
handler (reschedule_negotiator)
10/6 13:13:19 (pid:6354) Sent ad to central manager for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:13:19 (pid:6354) Sent ad to 1 collectors for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:13:19 (pid:6354) Called reschedule_negotiator()
10/6 13:13:21 (pid:6354) Inserting new attribute Scheduler into non-active
cluster cid=2221 acid=-1
10/6 13:18:19 (pid:6354) Sent ad to central manager for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:18:19 (pid:6354) Sent ad to 1 collectors for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:18:20 (pid:6354) Activity on stashed negotiator socket
10/6 13:18:20 (pid:6354) Negotiating for owner:
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:18:20 (pid:6354) Out of requests - 10 reqs matched, 0 reqs idle
10/6 13:18:21 (pid:6354) Inserting new attribute Scheduler into non-active
cluster cid=2221 acid=-1
10/6 13:18:23 (pid:6354) Starting add_shadow_birthdate(2221.0)
10/6 13:18:23 (pid:6354) Started shadow for job 2221.0 on
"<192.168.1.25:33961>", (shadow pid = 15015)
10/6 13:18:24 (pid:6354) Sent ad to central manager for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:18:24 (pid:6354) Sent ad to 1 collectors for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:18:24 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.21:49410>
10/6 13:18:24 (pid:6354) DaemonCore: received command 71003 (GIVE_MATCHES),
calling handler (DedicatedScheduler::giveMatches)
10/6 13:18:51 (pid:6354) In DedicatedScheduler::reaper pid 15015 has status
25600
10/6 13:18:51 (pid:6354) Shadow pid 15015 exited with status 100
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
10/6 13:23:24 (pid:6354) Sent ad to central manager for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:23:24 (pid:6354) Sent ad to 1 collectors for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:28:24 (pid:6354) Sent ad to central manager for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:28:24 (pid:6354) Sent ad to 1 collectors for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:28:51 (pid:6354) Resource vm1@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused
for 600 seconds, limit is 600, releasing
10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused
for 600 seconds, limit is 600, releasing
10/6 13:28:51 (pid:6354) Resource vm1@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused
for 600 seconds, limit is 600, releasing
10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused
for 600 seconds, limit is 600, releasing
10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused
for 600 seconds, limit is 600, releasing
10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused
for 600 seconds, limit is 600, releasing
10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused
for 600 seconds, limit is 600, releasing
10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused
for 600 seconds, limit is 600, releasing
10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused
for 600 seconds, limit is 600, releasing
10/6 13:28:51 (pid:6354) Resource vm1@xxxxxxxxxxxxxxxxxxxxxxxxxx has been unused
for 600 seconds, limit is 600, releasing
10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.23:36660>
10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.23:36660>
10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.22:48284>
10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.22:48284>
10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.24:37331>
10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.24:37331>
10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.23:56530>
10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.23:56530>
10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.26:36217>
10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.26:36217>
10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.27:39719>
10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.27:39719>
10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.28:58522>
10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.28:58522>
10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.24:59889>
10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.24:59889>
10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.22:34057>
10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.22:34057>
10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
<192.168.1.25:43167>
10/6 13:28:51 (pid:6354) DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.25:43167>
10/6 13:33:24 (pid:6354) Sent ad to central manager for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:33:24 (pid:6354) Sent ad to 1 collectors for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:38:24 (pid:6354) Sent ad to central manager for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:38:24 (pid:6354) Sent ad to 1 collectors for
jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
(...)


####################
# ShadowLog extract (at submit node, not an execution node)
####################

(...)
10/6 13:18:23 ******************************************************
10/6 13:18:23 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/6 13:18:23 ** /usr/local/condor-6.8.6/sbin/condor_shadow
10/6 13:18:23 ** $CondorVersion: 6.8.6 Sep 13 2007 $
10/6 13:18:23 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/6 13:18:23 ** PID = 15015
10/6 13:18:23 ** Log last touched 10/4 18:59:18
10/6 13:18:23 ******************************************************
10/6 13:18:23 Using config source: /home/condor/condor_config
10/6 13:18:23 Using local config sources:
10/6 13:18:23    /home/condor/condor_config.local
10/6 13:18:23 DaemonCore: Command Socket at <192.168.1.21:45720>
10/6 13:18:23 Initializing a PARALLEL shadow for job 2221.0
10/6 13:18:24 (2221.0) (15015): Request to run on <192.168.1.25:33961> was
ACCEPTED
10/6 13:18:24 (2221.0) (15015): Request to run on <192.168.1.22:57556> was
ACCEPTED
10/6 13:18:24 (2221.0) (15015): Request to run on <192.168.1.22:57556> was
ACCEPTED
10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.23:51800> was
ACCEPTED
10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.23:51800> was
ACCEPTED
10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.24:35810> was
ACCEPTED
10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.24:35810> was
ACCEPTED
10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.28:50572> was
ACCEPTED
10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.26:40791> was
ACCEPTED
10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.27:48119> was
ACCEPTED
10/6 13:18:51 (2221.0) (15015): Job 2221.0 terminated: exited with status 0
10/6 13:18:51 (2221.0) (15015): **** condor_shadow (condor_SHADOW) EXITING WITH
STATUS 100


###################
# StartLog extract (at node running the main Condor process, that is
'mp1script', and launching the MPI processes to be run in the other nodes)
###################

(...)
10/6 10:56:45 vm1: Changing activity: Benchmarking -> Idle
10/6 13:24:24 DaemonCore: Command received via UDP from host
<192.168.1.22:37956>
10/6 13:24:24 DaemonCore: received command 440 (MATCH_INFO), calling handler
(command_match_info)
10/6 13:24:24 vm2: match_info called
10/6 13:24:24 vm2: Received match <192.168.1.25:33961>#1222442286#232#...
10/6 13:24:24 vm2: State change: match notification protocol successful
10/6 13:24:24 vm2: Changing state: Unclaimed -> Matched
10/6 13:24:24 DaemonCore: Command received via TCP from host
<192.168.1.21:51134>
10/6 13:24:24 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim)
10/6 13:24:24 vm2: Request accepted.
10/6 13:24:24 vm2: Remote owner is
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:24:24 vm2: State change: claiming protocol successful
10/6 13:24:24 vm2: Changing state: Matched -> Claimed
10/6 13:24:29 DaemonCore: Command received via TCP from host
<192.168.1.21:44154>
10/6 13:24:29 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
(command_activate_claim)
10/6 13:24:29 vm2: Got activate_claim request from shadow (<192.168.1.21:44154>)
10/6 13:24:29 vm2: Remote job ID is 2221.0
10/6 13:24:29 vm2: Got universe "PARALLEL" (11) from request classad
10/6 13:24:29 vm2: State change: claim-activation protocol successful
10/6 13:24:29 vm2: Changing activity: Idle -> Busy
10/6 13:24:55 DaemonCore: Command received via TCP from host
<192.168.1.21:43016>
10/6 13:24:55 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY),
calling handler (command_handler)
10/6 13:24:55 vm2: Called deactivate_claim_forcibly()
10/6 13:24:55 Starter pid 25957 exited with status 0
10/6 13:24:55 vm2: State change: starter exited
10/6 13:24:55 vm2: Changing activity: Busy -> Idle
10/6 13:24:55 DaemonCore: Command received via TCP from host
<192.168.1.21:59999>
10/6 13:24:55 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling
handler (command_handler)
10/6 13:24:55 vm2: Called deactivate_claim()
10/6 13:34:55 DaemonCore: Command received via TCP from host
<192.168.1.21:40275>
10/6 13:34:55 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler
(command_release_claim)
10/6 13:34:55 vm2: State change: received RELEASE_CLAIM command
10/6 13:34:55 vm2: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
10/6 13:34:55 vm2: State change: No preempting claim, returning to owner
10/6 13:34:55 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle
10/6 13:34:55 vm2: State change: IS_OWNER is false
10/6 13:34:55 vm2: Changing state: Owner -> Unclaimed
10/6 14:56:45 State change: RunBenchmarks is TRUE
10/6 14:56:45 vm1: Changing activity: Idle -> Benchmarking
10/6 14:56:49 State change: benchmarks completed
10/6 14:56:49 vm1: Changing activity: Benchmarking -> Idle
(...)

##########################
# StarterLog.vm2 extract (at virtual machine 2 in the node running the main
Condor process, that is 'mp1script', and launching the MPI processes to be run
in the other nodes)
##########################

(...)
10/6 13:24:29 ******************************************************
10/6 13:24:29 ** condor_starter (CONDOR_STARTER) STARTING UP
10/6 13:24:29 ** /usr/local/condor-6.8.6/sbin/condor_starter
10/6 13:24:29 ** $CondorVersion: 6.8.6 Sep 13 2007 $
10/6 13:24:29 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/6 13:24:29 ** PID = 25957
10/6 13:24:29 ** Log last touched 10/4 18:08:06
10/6 13:24:29 ******************************************************
10/6 13:24:29 Using config source: /home/condor/condor_config
10/6 13:24:29 Using local config sources:
10/6 13:24:29    /home/condor/condor_config.local
10/6 13:24:29 DaemonCore: Command Socket at <192.168.1.25:54234>
10/6 13:24:29 Done setting resource limits
10/6 13:24:29 Communicating with shadow <192.168.1.21:45720>
10/6 13:24:29 Submitting machine is "bioxeon.ibmcp-cluster.upv.es"
10/6 13:24:29 Job has WantIOProxy=true
10/6 13:24:29 Initialized IO Proxy.
10/6 13:24:29 File transfer completed successfully.
10/6 13:24:30 Starting a PARALLEL universe job with ID: 2221.0
10/6 13:24:30 IWD: /home/condor/execute/dir_25957
10/6 13:24:30 Output file: /home/condor/execute/dir_25957/out.0.txt
10/6 13:24:30 Error file: /home/condor/execute/dir_25957/out.0.txt
10/6 13:24:30 About to exec /home/condor/execute/dir_25957/condor_exec.exe
HYPHYMPI MPITest.bf
10/6 13:24:30 Create_Process succeeded, pid=25959
10/6 13:24:31 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:31 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:34 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:34 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:34 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:34 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:35 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:35 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:36 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:36 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
10/6 13:24:55 Process exited, pid=25959, status=0
10/6 13:24:55 Got SIGQUIT.  Performing fast shutdown.
10/6 13:24:55 ShutdownFast all jobs.
10/6 13:24:55 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0


###############
# StartLog (at node executing one of the MPI parallel processes)
###############

10/6 13:18:01 DaemonCore: Command received via UDP from host
<192.168.1.22:37956>
10/6 13:18:01 DaemonCore: received command 440 (MATCH_INFO), calling handler
(command_match_info)
10/6 13:18:01 vm2: match_info called
10/6 13:18:01 vm2: Received match <192.168.1.22:57556>#1222440814#310#...
10/6 13:18:01 vm2: State change: match notification protocol successful
10/6 13:18:01 vm2: Changing state: Unclaimed -> Matched
10/6 13:18:01 DaemonCore: Command received via UDP from host
<192.168.1.22:37956>
10/6 13:18:01 DaemonCore: received command 440 (MATCH_INFO), calling handler
(command_match_info)
10/6 13:18:01 vm1: match_info called
10/6 13:18:01 vm1: Received match <192.168.1.22:57556>#1222440814#308#...
10/6 13:18:01 vm1: State change: match notification protocol successful
10/6 13:18:01 vm1: Changing state: Unclaimed -> Matched
10/6 13:18:01 DaemonCore: Command received via TCP from host
<192.168.1.21:46001>
10/6 13:18:01 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim)
10/6 13:18:01 vm2: Request accepted.
10/6 13:18:01 vm2: Remote owner is
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:18:01 vm2: State change: claiming protocol successful
10/6 13:18:01 vm2: Changing state: Matched -> Claimed
10/6 13:18:01 DaemonCore: Command received via TCP from host
<192.168.1.21:60616>
10/6 13:18:01 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim)
10/6 13:18:01 vm1: Request accepted.
10/6 13:18:01 vm1: Remote owner is
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
10/6 13:18:01 vm1: State change: claiming protocol successful
10/6 13:18:01 vm1: Changing state: Matched -> Claimed
10/6 13:18:05 DaemonCore: Command received via TCP from host
<192.168.1.21:42541>
10/6 13:18:05 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
(command_activate_claim)
10/6 13:18:05 vm1: Got activate_claim request from shadow (<192.168.1.21:42541>)
10/6 13:18:05 vm1: Remote job ID is 2221.0
10/6 13:18:05 vm1: Got universe "PARALLEL" (11) from request classad
10/6 13:18:05 vm1: State change: claim-activation protocol successful
10/6 13:18:05 vm1: Changing activity: Idle -> Busy
10/6 13:18:05 DaemonCore: Command received via TCP from host
<192.168.1.21:52511>
10/6 13:18:05 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
(command_activate_claim)
10/6 13:18:05 vm2: Got activate_claim request from shadow (<192.168.1.21:52511>)
10/6 13:18:05 vm2: Remote job ID is 2221.0
10/6 13:18:05 vm2: Got universe "PARALLEL" (11) from request classad
10/6 13:18:05 vm2: State change: claim-activation protocol successful
10/6 13:18:05 vm2: Changing activity: Idle -> Busy
10/6 13:18:32 DaemonCore: Command received via TCP from host
<192.168.1.21:34804>
10/6 13:18:32 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY),
calling handler (command_handler)
10/6 13:18:32 vm1: Called deactivate_claim_forcibly()
10/6 13:18:32 DaemonCore: Command received via TCP from host
<192.168.1.21:42458>
10/6 13:18:32 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY),
calling handler (command_handler)
10/6 13:18:32 vm2: Called deactivate_claim_forcibly()
10/6 13:18:32 Starter pid 25175 exited with status 0
10/6 13:18:32 vm1: State change: starter exited
10/6 13:18:32 vm1: Changing activity: Busy -> Idle
10/6 13:18:32 Starter pid 25176 exited with status 0
10/6 13:18:32 vm2: State change: starter exited
10/6 13:18:32 vm2: Changing activity: Busy -> Idle
10/6 13:18:32 DaemonCore: Command received via TCP from host
<192.168.1.21:33267>
10/6 13:18:32 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling
handler (command_handler)
10/6 13:18:32 vm1: Called deactivate_claim()
10/6 13:18:32 DaemonCore: Command received via TCP from host
<192.168.1.21:37421>
10/6 13:18:32 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling
handler (command_handler)
10/6 13:18:32 vm2: Called deactivate_claim()
10/6 13:28:31 DaemonCore: Command received via TCP from host
<192.168.1.21:43553>
10/6 13:28:31 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler
(command_release_claim)
10/6 13:28:31 vm2: State change: received RELEASE_CLAIM command
10/6 13:28:31 vm2: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
10/6 13:28:31 vm2: State change: No preempting claim, returning to owner
10/6 13:28:31 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle
10/6 13:28:31 vm2: State change: IS_OWNER is false
10/6 13:28:31 vm2: Changing state: Owner -> Unclaimed
10/6 13:28:31 DaemonCore: Command received via TCP from host
<192.168.1.21:48518>
10/6 13:28:31 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler
(command_release_claim)
10/6 13:28:31 vm1: State change: received RELEASE_CLAIM command
10/6 13:28:31 vm1: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
10/6 13:28:31 vm1: State change: No preempting claim, returning to owner
10/6 13:28:31 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle

10/6 13:28:31 vm1: State change: IS_OWNER is false
10/6 13:28:31 vm1: Changing state: Owner -> Unclaimed
(...)


##################
# StarterLog.vm2 (at virtual machine 2 of node executing one of the MPI parallel
processes)
##################

10/4 18:59:01 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
10/6 13:18:05 ******************************************************
10/6 13:18:05 ** condor_starter (CONDOR_STARTER) STARTING UP
10/6 13:18:05 ** /usr/local/condor-6.8.6/sbin/condor_starter
10/6 13:18:05 ** $CondorVersion: 6.8.6 Sep 13 2007 $
10/6 13:18:05 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/6 13:18:05 ** PID = 25176
10/6 13:18:05 ** Log last touched 10/4 18:59:01
10/6 13:18:05 ******************************************************
10/6 13:18:05 Using config source: /home/condor/condor_config
10/6 13:18:05 Using local config sources:
10/6 13:18:05    /home/condor/condor_config.local
10/6 13:18:05 DaemonCore: Command Socket at <192.168.1.22:42909>
10/6 13:18:05 Done setting resource limits
10/6 13:18:05 Communicating with shadow <192.168.1.21:45720>
10/6 13:18:05 Submitting machine is "bioxeon.ibmcp-cluster.upv.es"
10/6 13:18:05 Job has WantIOProxy=true
10/6 13:18:05 Initialized IO Proxy.
10/6 13:18:05 File transfer completed successfully.
10/6 13:18:06 Starting a PARALLEL universe job with ID: 2221.0
10/6 13:18:06 IWD: /home/condor/execute/dir_25176
10/6 13:18:06 Output file: /home/condor/execute/dir_25176/out.2.txt
10/6 13:18:06 Error file: /home/condor/execute/dir_25176/out.2.txt
10/6 13:18:06 About to exec /home/condor/execute/dir_25176/condor_exec.exe
HYPHYMPI MPITest.bf
10/6 13:18:06 Create_Process succeeded, pid=25180
10/6 13:18:08 IOProxy: accepting connection from 192.168.1.22
10/6 13:18:08 IOProxyHandler: closing connection to 192.168.1.22
10/6 13:18:12 IOProxy: accepting connection from 192.168.1.22
10/6 13:18:12 IOProxyHandler: closing connection to 192.168.1.22
10/6 13:18:32 Got SIGQUIT.  Performing fast shutdown.
10/6 13:18:32 ShutdownFast all jobs.
10/6 13:18:32 Process exited, pid=25180, status=0
10/6 13:18:32 condor_write(): Socket closed when trying to write 260 bytes to
<192.168.1.21:52511>, fd is 5
10/6 13:18:32 Buf::write(): condor_write() failed
10/6 13:18:32 Failed to send job exit status to shadow
10/6 13:18:32 JobExit() failed, waiting for job lease to expire or for a
reconnect attempt
10/6 13:18:32 Last process exited, now Starter is exiting
10/6 13:18:32 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0




--
Javier Forment Millet
Instituto de Biolog?a Celular y Molecular de Plantas (IBMCP) CSIC-UPV
Ciudad Polit?cnica de la Innovaci?n (CPI) Edificio 8 E, Escalera 7 Puerta E
Calle Ing. Fausto Elio s/n. 46022 Valencia, Spain
Tlf.:+34-96-3877858
FAX: +34-96-3877859
jforment@xxxxxxxxxxxx

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/