Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] HYPHY parallel MPI implementation on Condor

Date: Fri, 10 Oct 2008 15:40:43 +0200
From: Javier Forment Millet <jforment@xxxxxxxxxxxx>
Subject: Re: [Condor-users] HYPHY parallel MPI implementation on Condor
I've found the problem: condor_ssh script uses '==' for testing strings in
variables, which is incorrect, arising the following error messages:

> [: 68: ==: unexpected operator
> [: 101: ==: unexpected operator

And this causes further errors.

Just by changing '==' by '=' in condor_ssh scripts the problem is fixed. Condor
is now running MPI jobs nicely.

Maybe it is a bug already fixed in new Condor versions, or maybe it is specific
for my cluster configuration. I'm running Condor version 6.8.6 with Condor
platform X86_64-LINUX_RHEL3 in boxes with Intel Xeon 64-bit dual processors and
Ubuntu 7.10 (server version) with Linux kernel 2.6.22-15-generic. Do I must
report the problem in the Condor-developers mailing list, or in any other way?

Thanks a lot,

Javier.


Quoting Cecile_Garros@xxxxxxxxxxxxxxxxxxxxx:

> Hi Javier,
>
>
> You set:
>         export P4_RSHCOMMAND=$CONDOR_SSH
> So when mpirun is used, condor_ssh is executed to connect remote
> processes.
>
> Since your job is running fine when launched directly, i. e. without using
> Condor, a possibility is that the error occurs when the condor_ssh script
> is executed.
>
> It could be that the options used to run with P4_RSHCOMMAND by HYPHYMPI
> are not supported by condor_ssh. I guess you can backup your current
> condor_ssh script and modify it for debugging purpose if necessary.
>
> I hope this helps,
>
>
> Best regards,
>
> --
> Cecile GARROS
>
> IT Service Group
> Advanced Technologies Division
>
> --------------------------------
> ARGO GRAPHICS INC.
> www.argo-graph.co.jp
> tel: 03(5641)2025
> fax: 03(5641)2011
> --------------------------------
>
>
>
>
> Javier Forment Millet <jforment@xxxxxxxxxxxx>
> $BAw?.<T(B: condor-users-bounces@xxxxxxxxxxx
> 2008/10/07 19:39
> Condor-Users Mail List $B$XJV?.$7$F$/$@$5$$(B
>
>         $B08@h(B:   condor-users@xxxxxxxxxxx
>         cc:
>         $B7oL>(B:   [Condor-users] HYPHY parallel MPI implementation on
> Condor
>
>
> Hi, all... Is anyone running HYPHYMPI (http://www.hyphy.org) in a Condor
> cluster? I have a small Linux-Ubuntu cluster with MPI installed. I
> installed
> MPI version of HYPHY (HYPHYMPI). It runs fine when launched as a regular
> non-Condor MPI process:
>
> mpirun -v -np 10 -machinefile machines.list HYPHYMPI MPITest.bf
>
> But, when using Condor, it gives the following error:
>
> (...)
> running /home/condor/execute/dir_25957/HYPHYMPI on 10 LINUX ch_p4
> processors
> Created /home/condor/execute/dir_25957/PI26038
> [: 68: ==: unexpected operator
> [: 101: ==: unexpected operator
> Could not chdir to home directory /nonexistent: No such file or directory
> sh: -l: not found
> p0_26227:  p4_error: Child process exited while making connection to
> remote
> process on xeon2: 0
> p0_26227: (18.265625) net_send: could not write to fd=4, errno = 32
>
>
>
> Condor runs MPI jobs with the mp1script script (see below), but the error
> arises
> when running the last sentence 'mpirun ...'.
>
> The problem seems to be related with running HYPHYMPI in the Condor
> environment,
> since other Condor jobs run fine, and runnning just the 'mpirun...'
> sentence
> using the MPITest.bf file shipped with the HYPHY distribution run also
> fine.
> Specifically, the problem could be related with the fact that Condor run
> processes in execution directories where it transfers the executables and
> input
> files needed, and where it writes the output files which will be back
> transferred when execution finishes.
>
> Maybe in the HYPHYMPI code there is a point where it tries to do a 'chdir'
> from
> the execution directory and it raises the error because it cannot do it?
> What
> does it mean the error sentence 'Could not chdir to home directory
> /nonexistent: No such file or directory'? Is it trying to change to a
> directory
> specified by an empty string? Does the error arise from not being able to
> change
> directory?, or does it arise from a wrong definition of the directory name
> to
> change to? I've done 'grep -r -i 'chdir' ./*' from the HYPHY_Source folder
> and
> I get 'Binary file ./HYPHYMPI matches', but it does not match with any of
> the
> source code files of the distribution, so maybe it comes from any library
> linked when compiling.
>
> It also could be a problem related with the communication between the
> nodes
> executing the MPI parallel processes and the node running the Condor
> process
> which launch the corresponding parallel MPI processes in the different
> execution nodes. I can see some messages which could be related to this in
> the
> Condor log files (see below).
>
> HYPHYMPI was compiling in Ubuntu 7.04 (server version) boxes, using
> libcurl3
> (version 7.15.5-1ubuntu2.1), libssl0.9.8, libcrypto++5.2c2a, and
> libmpich1.0c2
> (version 1.2.7-2).
>
> Any idea?
>
> Thanks in advance, and best wishes,
>
> Javier.
>
>
> ##############
> # testMPI.cmd Condor submit file
> ##############
>
> universe = parallel
> executable = mp1script
> arguments = HYPHYMPI MPITest.bf
> run_as_owner = true
> machine_count = 10
> should_transfer_files = yes
> when_to_transfer_output = on_exit
> transfer_input_files =
> /home/jforment/install/HYPHY_Source/BatchFiles/MPITest.bf
> log = log.$(Process).txt
> output = out.$(NODE).txt
> error = out.$(NODE).txt
> +WantIOProxy = true
> queue
>
>
>
> #############
> # mp1script #
> #############
>
> #!/bin/sh
>
> _CONDOR_PROCNO=$_CONDOR_PROCNO
> _CONDOR_NPROCS=$_CONDOR_NPROCS
>
> echo "nprocs: $_CONDOR_NPROCS"
> echo "procno: $_CONDOR_PROCNO"
>
> # sshd.sh sets ssh without password using ssh-keygen
> CONDOR_LIBEXEC="/usr/local/condor/libexec"
> CONDOR_SSH=$CONDOR_LIBEXEC/condor_ssh
> SSHD_SH=$CONDOR_LIBEXEC/sshd.sh
>
> . $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS
> #$SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS
>
> # If not the head node, just sleep forever, to let the
> # sshds run
> if [ $_CONDOR_PROCNO -ne 0 ]
> then
>                wait
>                sshd_cleanup
>                exit 0
> fi
>
> EXECUTABLE=$1
> shift
>
> # the binary is copied but the executable flag is cleared.
> # so the script have to take care of this
> fullexepath="/usr/local/bin/$EXECUTABLE"
> cp $fullexepath $_CONDOR_SCRATCH_DIR/
> chmod +x $EXECUTABLE
>
> # Set this to the bin directory of MPICH installation
> MPDIR=/usr/lib/mpich/bin
> PATH=$MPDIR:.:$PATH
> export PATH
>
> export P4_RSHCOMMAND=$CONDOR_SSH
>
> CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
> export CONDOR_CONTACT_FILE
>
> # The second field in the contact file is the machine name
> # that condor_ssh knows how to use
> sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
>
> ## run the actual mpijob
> echo "nprocs: $_CONDOR_NPROCS"
> mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@
>
> sshd_cleanup
>
> rm $EXECUTABLE
>
> exit $?
>
>
> #######################
> # SchedLog extract (at the submit node, not an execution node)
> #######################
>
> (...)
> 10/6 13:09:16 (pid:6354) Sent ad to central manager for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:09:16 (pid:6354) Sent ad to 1 collectors for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:13:19 (pid:6354) DaemonCore: Command received via UDP from host
> <192.168.1.21:47086>
> 10/6 13:13:19 (pid:6354) DaemonCore: received command 421 (RESCHEDULE),
> calling
> handler (reschedule_negotiator)
> 10/6 13:13:19 (pid:6354) Sent ad to central manager for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:13:19 (pid:6354) Sent ad to 1 collectors for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:13:19 (pid:6354) Called reschedule_negotiator()
> 10/6 13:13:21 (pid:6354) Inserting new attribute Scheduler into non-active
> cluster cid=2221 acid=-1
> 10/6 13:18:19 (pid:6354) Sent ad to central manager for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:18:19 (pid:6354) Sent ad to 1 collectors for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:18:20 (pid:6354) Activity on stashed negotiator socket
> 10/6 13:18:20 (pid:6354) Negotiating for owner:
> DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:18:20 (pid:6354) Out of requests - 10 reqs matched, 0 reqs idle
> 10/6 13:18:21 (pid:6354) Inserting new attribute Scheduler into non-active
> cluster cid=2221 acid=-1
> 10/6 13:18:23 (pid:6354) Starting add_shadow_birthdate(2221.0)
> 10/6 13:18:23 (pid:6354) Started shadow for job 2221.0 on
> "<192.168.1.25:33961>", (shadow pid = 15015)
> 10/6 13:18:24 (pid:6354) Sent ad to central manager for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:18:24 (pid:6354) Sent ad to 1 collectors for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:18:24 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.21:49410>
> 10/6 13:18:24 (pid:6354) DaemonCore: received command 71003
> (GIVE_MATCHES),
> calling handler (DedicatedScheduler::giveMatches)
> 10/6 13:18:51 (pid:6354) In DedicatedScheduler::reaper pid 15015 has
> status
> 25600
> 10/6 13:18:51 (pid:6354) Shadow pid 15015 exited with status 100
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:18:51 (pid:6354) DedicatedScheduler::deallocMatchRec
> 10/6 13:23:24 (pid:6354) Sent ad to central manager for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:23:24 (pid:6354) Sent ad to 1 collectors for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:28:24 (pid:6354) Sent ad to central manager for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:28:24 (pid:6354) Sent ad to 1 collectors for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:28:51 (pid:6354) Resource vm1@xxxxxxxxxxxxxxxxxxxxxxxxxx has been
> unused
> for 600 seconds, limit is 600, releasing
> 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been
> unused
> for 600 seconds, limit is 600, releasing
> 10/6 13:28:51 (pid:6354) Resource vm1@xxxxxxxxxxxxxxxxxxxxxxxxxx has been
> unused
> for 600 seconds, limit is 600, releasing
> 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been
> unused
> for 600 seconds, limit is 600, releasing
> 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been
> unused
> for 600 seconds, limit is 600, releasing
> 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been
> unused
> for 600 seconds, limit is 600, releasing
> 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been
> unused
> for 600 seconds, limit is 600, releasing
> 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been
> unused
> for 600 seconds, limit is 600, releasing
> 10/6 13:28:51 (pid:6354) Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx has been
> unused
> for 600 seconds, limit is 600, releasing
> 10/6 13:28:51 (pid:6354) Resource vm1@xxxxxxxxxxxxxxxxxxxxxxxxxx has been
> unused
> for 600 seconds, limit is 600, releasing
> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.23:36660>
> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443
> (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.23:36660>
> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.22:48284>
> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443
> (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.22:48284>
> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.24:37331>
> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443
> (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.24:37331>
> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.23:56530>
> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443
> (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.23:56530>
> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.26:36217>
> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443
> (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.26:36217>
> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.27:39719>
> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443
> (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.27:39719>
> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.28:58522>
> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443
> (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.28:58522>
> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.24:59889>
> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443
> (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.24:59889>
> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.22:34057>
> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443
> (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.22:34057>
> 10/6 13:28:51 (pid:6354) DaemonCore: Command received via TCP from host
> <192.168.1.25:43167>
> 10/6 13:28:51 (pid:6354) DaemonCore: received command 443
> (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/6 13:28:51 (pid:6354) Got VACATE_SERVICE from <192.168.1.25:43167>
> 10/6 13:33:24 (pid:6354) Sent ad to central manager for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:33:24 (pid:6354) Sent ad to 1 collectors for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:38:24 (pid:6354) Sent ad to central manager for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:38:24 (pid:6354) Sent ad to 1 collectors for
> jforment@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> (...)
>
>
> ####################
> # ShadowLog extract (at submit node, not an execution node)
> ####################
>
> (...)
> 10/6 13:18:23 ******************************************************
> 10/6 13:18:23 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 10/6 13:18:23 ** /usr/local/condor-6.8.6/sbin/condor_shadow
> 10/6 13:18:23 ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 10/6 13:18:23 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
> 10/6 13:18:23 ** PID = 15015
> 10/6 13:18:23 ** Log last touched 10/4 18:59:18
> 10/6 13:18:23 ******************************************************
> 10/6 13:18:23 Using config source: /home/condor/condor_config
> 10/6 13:18:23 Using local config sources:
> 10/6 13:18:23    /home/condor/condor_config.local
> 10/6 13:18:23 DaemonCore: Command Socket at <192.168.1.21:45720>
> 10/6 13:18:23 Initializing a PARALLEL shadow for job 2221.0
> 10/6 13:18:24 (2221.0) (15015): Request to run on <192.168.1.25:33961> was
> ACCEPTED
> 10/6 13:18:24 (2221.0) (15015): Request to run on <192.168.1.22:57556> was
> ACCEPTED
> 10/6 13:18:24 (2221.0) (15015): Request to run on <192.168.1.22:57556> was
> ACCEPTED
> 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.23:51800> was
> ACCEPTED
> 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.23:51800> was
> ACCEPTED
> 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.24:35810> was
> ACCEPTED
> 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.24:35810> was
> ACCEPTED
> 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.28:50572> was
> ACCEPTED
> 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.26:40791> was
> ACCEPTED
> 10/6 13:18:25 (2221.0) (15015): Request to run on <192.168.1.27:48119> was
> ACCEPTED
> 10/6 13:18:51 (2221.0) (15015): Job 2221.0 terminated: exited with status
> 0
> 10/6 13:18:51 (2221.0) (15015): **** condor_shadow (condor_SHADOW) EXITING
> WITH
> STATUS 100
>
>
> ###################
> # StartLog extract (at node running the main Condor process, that is
> 'mp1script', and launching the MPI processes to be run in the other nodes)
> ###################
>
> (...)
> 10/6 10:56:45 vm1: Changing activity: Benchmarking -> Idle
> 10/6 13:24:24 DaemonCore: Command received via UDP from host
> <192.168.1.22:37956>
> 10/6 13:24:24 DaemonCore: received command 440 (MATCH_INFO), calling
> handler
> (command_match_info)
> 10/6 13:24:24 vm2: match_info called
> 10/6 13:24:24 vm2: Received match <192.168.1.25:33961>#1222442286#232#...
> 10/6 13:24:24 vm2: State change: match notification protocol successful
> 10/6 13:24:24 vm2: Changing state: Unclaimed -> Matched
> 10/6 13:24:24 DaemonCore: Command received via TCP from host
> <192.168.1.21:51134>
> 10/6 13:24:24 DaemonCore: received command 442 (REQUEST_CLAIM), calling
> handler
> (command_request_claim)
> 10/6 13:24:24 vm2: Request accepted.
> 10/6 13:24:24 vm2: Remote owner is
> DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:24:24 vm2: State change: claiming protocol successful
> 10/6 13:24:24 vm2: Changing state: Matched -> Claimed
> 10/6 13:24:29 DaemonCore: Command received via TCP from host
> <192.168.1.21:44154>
> 10/6 13:24:29 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
> handler
> (command_activate_claim)
> 10/6 13:24:29 vm2: Got activate_claim request from shadow
> (<192.168.1.21:44154>)
> 10/6 13:24:29 vm2: Remote job ID is 2221.0
> 10/6 13:24:29 vm2: Got universe "PARALLEL" (11) from request classad
> 10/6 13:24:29 vm2: State change: claim-activation protocol successful
> 10/6 13:24:29 vm2: Changing activity: Idle -> Busy
> 10/6 13:24:55 DaemonCore: Command received via TCP from host
> <192.168.1.21:43016>
> 10/6 13:24:55 DaemonCore: received command 404
> (DEACTIVATE_CLAIM_FORCIBLY),
> calling handler (command_handler)
> 10/6 13:24:55 vm2: Called deactivate_claim_forcibly()
> 10/6 13:24:55 Starter pid 25957 exited with status 0
> 10/6 13:24:55 vm2: State change: starter exited
> 10/6 13:24:55 vm2: Changing activity: Busy -> Idle
> 10/6 13:24:55 DaemonCore: Command received via TCP from host
> <192.168.1.21:59999>
> 10/6 13:24:55 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling
> handler (command_handler)
> 10/6 13:24:55 vm2: Called deactivate_claim()
> 10/6 13:34:55 DaemonCore: Command received via TCP from host
> <192.168.1.21:40275>
> 10/6 13:34:55 DaemonCore: received command 443 (RELEASE_CLAIM), calling
> handler
> (command_release_claim)
> 10/6 13:34:55 vm2: State change: received RELEASE_CLAIM command
> 10/6 13:34:55 vm2: Changing state and activity: Claimed/Idle ->
> Preempting/Vacating
> 10/6 13:34:55 vm2: State change: No preempting claim, returning to owner
> 10/6 13:34:55 vm2: Changing state and activity: Preempting/Vacating ->
> Owner/Idle
> 10/6 13:34:55 vm2: State change: IS_OWNER is false
> 10/6 13:34:55 vm2: Changing state: Owner -> Unclaimed
> 10/6 14:56:45 State change: RunBenchmarks is TRUE
> 10/6 14:56:45 vm1: Changing activity: Idle -> Benchmarking
> 10/6 14:56:49 State change: benchmarks completed
> 10/6 14:56:49 vm1: Changing activity: Benchmarking -> Idle
> (...)
>
> ##########################
> # StarterLog.vm2 extract (at virtual machine 2 in the node running the
> main
> Condor process, that is 'mp1script', and launching the MPI processes to be
> run
> in the other nodes)
> ##########################
>
> (...)
> 10/6 13:24:29 ******************************************************
> 10/6 13:24:29 ** condor_starter (CONDOR_STARTER) STARTING UP
> 10/6 13:24:29 ** /usr/local/condor-6.8.6/sbin/condor_starter
> 10/6 13:24:29 ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 10/6 13:24:29 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
> 10/6 13:24:29 ** PID = 25957
> 10/6 13:24:29 ** Log last touched 10/4 18:08:06
> 10/6 13:24:29 ******************************************************
> 10/6 13:24:29 Using config source: /home/condor/condor_config
> 10/6 13:24:29 Using local config sources:
> 10/6 13:24:29    /home/condor/condor_config.local
> 10/6 13:24:29 DaemonCore: Command Socket at <192.168.1.25:54234>
> 10/6 13:24:29 Done setting resource limits
> 10/6 13:24:29 Communicating with shadow <192.168.1.21:45720>
> 10/6 13:24:29 Submitting machine is "bioxeon.ibmcp-cluster.upv.es"
> 10/6 13:24:29 Job has WantIOProxy=true
> 10/6 13:24:29 Initialized IO Proxy.
> 10/6 13:24:29 File transfer completed successfully.
> 10/6 13:24:30 Starting a PARALLEL universe job with ID: 2221.0
> 10/6 13:24:30 IWD: /home/condor/execute/dir_25957
> 10/6 13:24:30 Output file: /home/condor/execute/dir_25957/out.0.txt
> 10/6 13:24:30 Error file: /home/condor/execute/dir_25957/out.0.txt
> 10/6 13:24:30 About to exec /home/condor/execute/dir_25957/condor_exec.exe
> HYPHYMPI MPITest.bf
> 10/6 13:24:30 Create_Process succeeded, pid=25959
> 10/6 13:24:31 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:31 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:34 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:34 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:34 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:34 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:35 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:35 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:36 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:36 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:37 IOProxy: accepting connection from 192.168.1.25
> 10/6 13:24:37 IOProxyHandler: closing connection to 192.168.1.25
> 10/6 13:24:55 Process exited, pid=25959, status=0
> 10/6 13:24:55 Got SIGQUIT.  Performing fast shutdown.
> 10/6 13:24:55 ShutdownFast all jobs.
> 10/6 13:24:55 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
>
>
> ###############
> # StartLog (at node executing one of the MPI parallel processes)
> ###############
>
> 10/6 13:18:01 DaemonCore: Command received via UDP from host
> <192.168.1.22:37956>
> 10/6 13:18:01 DaemonCore: received command 440 (MATCH_INFO), calling
> handler
> (command_match_info)
> 10/6 13:18:01 vm2: match_info called
> 10/6 13:18:01 vm2: Received match <192.168.1.22:57556>#1222440814#310#...
> 10/6 13:18:01 vm2: State change: match notification protocol successful
> 10/6 13:18:01 vm2: Changing state: Unclaimed -> Matched
> 10/6 13:18:01 DaemonCore: Command received via UDP from host
> <192.168.1.22:37956>
> 10/6 13:18:01 DaemonCore: received command 440 (MATCH_INFO), calling
> handler
> (command_match_info)
> 10/6 13:18:01 vm1: match_info called
> 10/6 13:18:01 vm1: Received match <192.168.1.22:57556>#1222440814#308#...
> 10/6 13:18:01 vm1: State change: match notification protocol successful
> 10/6 13:18:01 vm1: Changing state: Unclaimed -> Matched
> 10/6 13:18:01 DaemonCore: Command received via TCP from host
> <192.168.1.21:46001>
> 10/6 13:18:01 DaemonCore: received command 442 (REQUEST_CLAIM), calling
> handler
> (command_request_claim)
> 10/6 13:18:01 vm2: Request accepted.
> 10/6 13:18:01 vm2: Remote owner is
> DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:18:01 vm2: State change: claiming protocol successful
> 10/6 13:18:01 vm2: Changing state: Matched -> Claimed
> 10/6 13:18:01 DaemonCore: Command received via TCP from host
> <192.168.1.21:60616>
> 10/6 13:18:01 DaemonCore: received command 442 (REQUEST_CLAIM), calling
> handler
> (command_request_claim)
> 10/6 13:18:01 vm1: Request accepted.
> 10/6 13:18:01 vm1: Remote owner is
> DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 10/6 13:18:01 vm1: State change: claiming protocol successful
> 10/6 13:18:01 vm1: Changing state: Matched -> Claimed
> 10/6 13:18:05 DaemonCore: Command received via TCP from host
> <192.168.1.21:42541>
> 10/6 13:18:05 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
> handler
> (command_activate_claim)
> 10/6 13:18:05 vm1: Got activate_claim request from shadow
> (<192.168.1.21:42541>)
> 10/6 13:18:05 vm1: Remote job ID is 2221.0
> 10/6 13:18:05 vm1: Got universe "PARALLEL" (11) from request classad
> 10/6 13:18:05 vm1: State change: claim-activation protocol successful
> 10/6 13:18:05 vm1: Changing activity: Idle -> Busy
> 10/6 13:18:05 DaemonCore: Command received via TCP from host
> <192.168.1.21:52511>
> 10/6 13:18:05 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
> handler
> (command_activate_claim)
> 10/6 13:18:05 vm2: Got activate_claim request from shadow
> (<192.168.1.21:52511>)
> 10/6 13:18:05 vm2: Remote job ID is 2221.0
> 10/6 13:18:05 vm2: Got universe "PARALLEL" (11) from request classad
> 10/6 13:18:05 vm2: State change: claim-activation protocol successful
> 10/6 13:18:05 vm2: Changing activity: Idle -> Busy
> 10/6 13:18:32 DaemonCore: Command received via TCP from host
> <192.168.1.21:34804>
> 10/6 13:18:32 DaemonCore: received command 404
> (DEACTIVATE_CLAIM_FORCIBLY),
> calling handler (command_handler)
> 10/6 13:18:32 vm1: Called deactivate_claim_forcibly()
> 10/6 13:18:32 DaemonCore: Command received via TCP from host
> <192.168.1.21:42458>
> 10/6 13:18:32 DaemonCore: received command 404
> (DEACTIVATE_CLAIM_FORCIBLY),
> calling handler (command_handler)
> 10/6 13:18:32 vm2: Called deactivate_claim_forcibly()
> 10/6 13:18:32 Starter pid 25175 exited with status 0
> 10/6 13:18:32 vm1: State change: starter exited
> 10/6 13:18:32 vm1: Changing activity: Busy -> Idle
> 10/6 13:18:32 Starter pid 25176 exited with status 0
> 10/6 13:18:32 vm2: State change: starter exited
> 10/6 13:18:32 vm2: Changing activity: Busy -> Idle
> 10/6 13:18:32 DaemonCore: Command received via TCP from host
> <192.168.1.21:33267>
> 10/6 13:18:32 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling
> handler (command_handler)
> 10/6 13:18:32 vm1: Called deactivate_claim()
> 10/6 13:18:32 DaemonCore: Command received via TCP from host
> <192.168.1.21:37421>
> 10/6 13:18:32 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling
> handler (command_handler)
> 10/6 13:18:32 vm2: Called deactivate_claim()
> 10/6 13:28:31 DaemonCore: Command received via TCP from host
> <192.168.1.21:43553>
> 10/6 13:28:31 DaemonCore: received command 443 (RELEASE_CLAIM), calling
> handler
> (command_release_claim)
> 10/6 13:28:31 vm2: State change: received RELEASE_CLAIM command
> 10/6 13:28:31 vm2: Changing state and activity: Claimed/Idle ->
> Preempting/Vacating
> 10/6 13:28:31 vm2: State change: No preempting claim, returning to owner
> 10/6 13:28:31 vm2: Changing state and activity: Preempting/Vacating ->
> Owner/Idle
> 10/6 13:28:31 vm2: State change: IS_OWNER is false
> 10/6 13:28:31 vm2: Changing state: Owner -> Unclaimed
> 10/6 13:28:31 DaemonCore: Command received via TCP from host
> <192.168.1.21:48518>
> 10/6 13:28:31 DaemonCore: received command 443 (RELEASE_CLAIM), calling
> handler
> (command_release_claim)
> 10/6 13:28:31 vm1: State change: received RELEASE_CLAIM command
> 10/6 13:28:31 vm1: Changing state and activity: Claimed/Idle ->
> Preempting/Vacating
> 10/6 13:28:31 vm1: State change: No preempting claim, returning to owner
> 10/6 13:28:31 vm1: Changing state and activity: Preempting/Vacating ->
> Owner/Idle
> 10/6 13:28:31 vm1: State change: IS_OWNER is false
> 10/6 13:28:31 vm1: Changing state: Owner -> Unclaimed
> (...)
>
>
> ##################
> # StarterLog.vm2 (at virtual machine 2 of node executing one of the MPI
> parallel
> processes)
> ##################
>
> 10/4 18:59:01 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
> 10/6 13:18:05 ******************************************************
> 10/6 13:18:05 ** condor_starter (CONDOR_STARTER) STARTING UP
> 10/6 13:18:05 ** /usr/local/condor-6.8.6/sbin/condor_starter
> 10/6 13:18:05 ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 10/6 13:18:05 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
> 10/6 13:18:05 ** PID = 25176
> 10/6 13:18:05 ** Log last touched 10/4 18:59:01
> 10/6 13:18:05 ******************************************************
> 10/6 13:18:05 Using config source: /home/condor/condor_config
> 10/6 13:18:05 Using local config sources:
> 10/6 13:18:05    /home/condor/condor_config.local
> 10/6 13:18:05 DaemonCore: Command Socket at <192.168.1.22:42909>
> 10/6 13:18:05 Done setting resource limits
> 10/6 13:18:05 Communicating with shadow <192.168.1.21:45720>
> 10/6 13:18:05 Submitting machine is "bioxeon.ibmcp-cluster.upv.es"
> 10/6 13:18:05 Job has WantIOProxy=true
> 10/6 13:18:05 Initialized IO Proxy.
> 10/6 13:18:05 File transfer completed successfully.
> 10/6 13:18:06 Starting a PARALLEL universe job with ID: 2221.0
> 10/6 13:18:06 IWD: /home/condor/execute/dir_25176
> 10/6 13:18:06 Output file: /home/condor/execute/dir_25176/out.2.txt
> 10/6 13:18:06 Error file: /home/condor/execute/dir_25176/out.2.txt
> 10/6 13:18:06 About to exec /home/condor/execute/dir_25176/condor_exec.exe
> HYPHYMPI MPITest.bf
> 10/6 13:18:06 Create_Process succeeded, pid=25180
> 10/6 13:18:08 IOProxy: accepting connection from 192.168.1.22
> 10/6 13:18:08 IOProxyHandler: closing connection to 192.168.1.22
> 10/6 13:18:12 IOProxy: accepting connection from 192.168.1.22
> 10/6 13:18:12 IOProxyHandler: closing connection to 192.168.1.22
> 10/6 13:18:32 Got SIGQUIT.  Performing fast shutdown.
> 10/6 13:18:32 ShutdownFast all jobs.
> 10/6 13:18:32 Process exited, pid=25180, status=0
> 10/6 13:18:32 condor_write(): Socket closed when trying to write 260 bytes
> to
> <192.168.1.21:52511>, fd is 5
> 10/6 13:18:32 Buf::write(): condor_write() failed
> 10/6 13:18:32 Failed to send job exit status to shadow
> 10/6 13:18:32 JobExit() failed, waiting for job lease to expire or for a
> reconnect attempt
> 10/6 13:18:32 Last process exited, now Starter is exiting
> 10/6 13:18:32 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
>
>
>
>
> --
> Javier Forment Millet
> Instituto de Biolog?a Celular y Molecular de Plantas (IBMCP) CSIC-UPV
>  Ciudad Polit?cnica de la Innovaci?n (CPI) Edificio 8 E, Escalera 7 Puerta
> E
>  Calle Ing. Fausto Elio s/n. 46022 Valencia, Spain
> Tlf.:+34-96-3877858
> FAX: +34-96-3877859
> jforment@xxxxxxxxxxxx
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>


-- 
Javier Forment Millet
Instituto de Biología Celular y Molecular de Plantas (IBMCP) CSIC-UPV
 Ciudad Politécnica de la Innovación (CPI) Edificio 8 E, Escalera 7 Puerta E
 Calle Ing. Fausto Elio s/n. 46022 Valencia, Spain
Tlf.:+34-96-3877858
FAX: +34-96-3877859
jforment@xxxxxxxxxxxx
Follow-Ups:
- Re: [Condor-users] HYPHY parallel MPI implementation on Condor
  - From: Todd Tannenbaum
References:
- Re: [Condor-users] HYPHY parallel MPI implementation on Condor
  - From: Cecile_Garros
Prev by Date: Re: [Condor-users] Basic installation
Next by Date: Re: [Condor-users] Dagman Newbie Questions
Previous by thread: Re: [Condor-users] HYPHY parallel MPI implementation on Condor
Next by thread: Re: [Condor-users] HYPHY parallel MPI implementation on Condor
Index(es):
- Date
- Thread
Mailing List Archives

Public Access

Re: [Condor-users] HYPHY parallel MPI implementation on Condor