[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] mpi jobs not dying properly



Well, I think the problem is becoming apparent.
Condor launches a Startd, which calls the mp2script.sh which runs mpdboot to start the mvapich2 daemon on all the machines, and then the script runs mpiexec to launch the job. When you do a condor_rm it kills the startd, and it kills the mpd ring, but that leaves the actual mpiexec jobs orphaned. somehow i need condor to keep track of what the script is running, and clean up after itself.

On Aug 20, 2009, at 13:34 , Peter Doherty wrote:

I turned on D_NETWORK debugging on the STARTED


from the machine with the zombie processes:


8/20 13:31:24 ACCEPT from=<10.0.10.43:47668> newfd=6
to=<10.0.10.43:34330>
8/20 13:31:24 condor_read(fd=6
<10.0.10.43:47668>,,size=4,timeout=1,flags=2)
8/20 13:31:24 condor_read(fd=6
<10.0.10.43:47668>,,size=5,timeout=1,flags=0)
8/20 13:31:24 condor_read(fd=6
<10.0.10.43:47668>,,size=698,timeout=1,flags=0)
8/20 13:31:24 encrypting secret
8/20 13:31:24 condor_read(fd=6
<10.0.10.43:47668>,,size=5,timeout=20,flags=0)
8/20 13:31:24 condor_read(fd=6
<10.0.10.43:47668>,,size=74,timeout=20,flags=0)
8/20 13:31:24 Stream::get(int) incorrect pad received: 4b
8/20 13:31:24 Can't read ClaimId
8/20 13:31:24 condor_write(fd=6
<10.0.10.43:47668>,,size=13,timeout=20,flags=0)
8/20 13:31:24 condor_write(): socket 6 is readable
8/20 13:31:24 condor_write(): Socket closed when trying to write 13
bytes to <10.0.10.43:47668>, fd is 6
8/20 13:31:24 Buf::write(): condor_write() failed

from the machine running the master MPI process:

8/20 13:31:00 ACCEPT from=<10.0.10.43:55702> newfd=6
to=<10.0.10.42:54879>
8/20 13:31:00 condor_read(fd=6
<10.0.10.43:55702>,,size=4,timeout=1,flags=2)
8/20 13:31:00 condor_read(fd=6
<10.0.10.43:55702>,,size=5,timeout=1,flags=0)
8/20 13:31:00 condor_read(fd=6
<10.0.10.43:55702>,,size=699,timeout=1,flags=0)
8/20 13:31:00 encrypting secret
8/20 13:31:00 condor_read(fd=6
<10.0.10.43:55702>,,size=5,timeout=20,flags=0)
8/20 13:31:00 condor_read(fd=6
<10.0.10.43:55702>,,size=74,timeout=20,flags=0)
8/20 13:31:00 Stream::get(int) incorrect pad received: ffffffc0
8/20 13:31:00 Can't read ClaimId
8/20 13:31:00 condor_write(fd=6
<10.0.10.43:55702>,,size=13,timeout=20,flags=0)
8/20 13:31:00 condor_write(): socket 6 is readable
8/20 13:31:00 condor_write(): Socket closed when trying to write 13
bytes to <10.0.10.43:55702>, fd is 6
8/20 13:31:00 Buf::write(): condor_write() failed
8/20 13:31:00 CLOSE <10.0.10.42:54879> fd=6



On Aug 20, 2009, at 12:54 , Peter Doherty wrote:

I finally got an MPI job to run on a couple nodes in the cluster from
a condor job.

When when I do a condor_rm on the job, it only dies on the node that
is running the master process.
I've got these errors in my StartLog

8/20 12:52:05 Can't read ClaimId
8/20 12:52:05 condor_write(): Socket closed when trying to write 13
bytes to <10.0.10.43:37598>, fd is 6
8/20 12:52:05 Buf::write(): condor_write() failed


10.0.10.43 is the node that still has the child process running, and I
have to manually kill it.

Any thoughts.
Thanks.

--Peter
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature