Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] mpi jobs not dying properly

Date: Mon, 24 Aug 2009 11:23:15 -0400
From: Peter Doherty <doherty@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] mpi jobs not dying properly

Well, I think the problem is becoming apparent.

Condor launches a Startd, which calls the mp2script.sh which runsmpdboot to start the mvapich2 daemon on all the machines, and then thescript runs mpiexec to launch the job.When you do a condor_rm it kills the startd, and it kills the mpdring, but that leaves the actual mpiexec jobs orphaned.somehow i need condor to keep track of what the script is running, andclean up after itself.


On Aug 20, 2009, at 13:34 , Peter Doherty wrote:

I turned on D_NETWORK debugging on the STARTED


from the machine with the zombie processes:


8/20 13:31:24 ACCEPT from=<10.0.10.43:47668> newfd=6
to=<10.0.10.43:34330>
8/20 13:31:24 condor_read(fd=6
<10.0.10.43:47668>,,size=4,timeout=1,flags=2)
8/20 13:31:24 condor_read(fd=6
<10.0.10.43:47668>,,size=5,timeout=1,flags=0)
8/20 13:31:24 condor_read(fd=6
<10.0.10.43:47668>,,size=698,timeout=1,flags=0)
8/20 13:31:24 encrypting secret
8/20 13:31:24 condor_read(fd=6
<10.0.10.43:47668>,,size=5,timeout=20,flags=0)
8/20 13:31:24 condor_read(fd=6
<10.0.10.43:47668>,,size=74,timeout=20,flags=0)
8/20 13:31:24 Stream::get(int) incorrect pad received: 4b
8/20 13:31:24 Can't read ClaimId
8/20 13:31:24 condor_write(fd=6
<10.0.10.43:47668>,,size=13,timeout=20,flags=0)
8/20 13:31:24 condor_write(): socket 6 is readable
8/20 13:31:24 condor_write(): Socket closed when trying to write 13
bytes to <10.0.10.43:47668>, fd is 6
8/20 13:31:24 Buf::write(): condor_write() failed

from the machine running the master MPI process:

8/20 13:31:00 ACCEPT from=<10.0.10.43:55702> newfd=6
to=<10.0.10.42:54879>
8/20 13:31:00 condor_read(fd=6
<10.0.10.43:55702>,,size=4,timeout=1,flags=2)
8/20 13:31:00 condor_read(fd=6
<10.0.10.43:55702>,,size=5,timeout=1,flags=0)
8/20 13:31:00 condor_read(fd=6
<10.0.10.43:55702>,,size=699,timeout=1,flags=0)
8/20 13:31:00 encrypting secret
8/20 13:31:00 condor_read(fd=6
<10.0.10.43:55702>,,size=5,timeout=20,flags=0)
8/20 13:31:00 condor_read(fd=6
<10.0.10.43:55702>,,size=74,timeout=20,flags=0)
8/20 13:31:00 Stream::get(int) incorrect pad received: ffffffc0
8/20 13:31:00 Can't read ClaimId
8/20 13:31:00 condor_write(fd=6
<10.0.10.43:55702>,,size=13,timeout=20,flags=0)
8/20 13:31:00 condor_write(): socket 6 is readable
8/20 13:31:00 condor_write(): Socket closed when trying to write 13
bytes to <10.0.10.43:55702>, fd is 6
8/20 13:31:00 Buf::write(): condor_write() failed
8/20 13:31:00 CLOSE <10.0.10.42:54879> fd=6



On Aug 20, 2009, at 12:54 , Peter Doherty wrote:

I finally got an MPI job to run on a couple nodes in the cluster from
a condor job.

When when I do a condor_rm on the job, it only dies on the node that
is running the master process.
I've got these errors in my StartLog

8/20 12:52:05 Can't read ClaimId
8/20 12:52:05 condor_write(): Socket closed when trying to write 13
bytes to <10.0.10.43:37598>, fd is 6
8/20 12:52:05 Buf::write(): condor_write() failed

10.0.10.43 is the node that still has the child process running,and I

have to manually kill it.

Any thoughts.
Thanks.

--Peter
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


_______________________________________________
Condor-users mailing list

To unsubscribe, send a message to condor-users-request@xxxxxxxxxxxwith a

subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature

References:
- [Condor-users] mpi jobs not dying properly
  - From: Peter Doherty
- Re: [Condor-users] mpi jobs not dying properly
  - From: Peter Doherty

Prev by Date: Re: [Condor-users] Stopping job in condor
Next by Date: Re: [Condor-users] Stopping job in condor
Previous by thread: Re: [Condor-users] mpi jobs not dying properly
Next by thread: [Condor-users] Grid with SMP hosts having limited custom hardware
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] mpi jobs not dying properly