Re: [HTCondor-users] Prevent slot from re-entering pool after Shadow exception/cannot allocate memory

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On Wed, Mar 8, 2017 at 9:51 AM, Greg Thain <gthain@xxxxxxxxxxx> wrote:

On 03/08/2017 07:44 AM, Zell, Wesley wrote:

I am using Condor on a dedicated CentOS7 cluster to run a parameter estimation process (PEST/BeoPEST). In brief: a master instance of the software runs on the submit node, while worker instances of the software run within the Condor working directory on the execute nodes.

On occasion, after the process hums along happily for a few hours, I get the following combination of errors that crashes the job:

1. The communication between the submit node and an execute node is interrupted;

2. When the communication is re-established (or appears to be), the software instance on the submit node crashes.

My specific questions:

a. What is the cause of the first error (i.e., the communication interruption - see log file excerpt below)?

This is hard to say, there is some network interruption between the submit side and the execute side that is closing a tcp connection unexpectedly, even though the processes are running on both sides.Â It does appear from the error messages that thjob is running out of memory -- could there be a memory leak, or other problem that is causing excessive memory usage?

b. If the communication disruption cannot be prevented, can I prevent the slot from rejoining the pool after interruption? (My software has some resiliency with respect to failed workers, but apparently not to workers that disappear then rejoin.)

Assuming you want to prevent the *job* from restarting, there's a recipe on the wiki that does just this:

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAvoidJobRestarts

-greg
Thanks.

Here are more details:

***************************

(1) Excerpt from log file:

***************************

...

006 (113.032.000) 03/07 23:57:30 Image size of job updated: 2695712

Â Â Â Â 445 Â- ÂMemoryUsage of job (MB)

Â Â Â Â 2010192 Â- ÂResidentSetSize of job (KB)

...

022 (113.045.000) 03/07 23:57:31 Job disconnected, attempting to reconnect

Â Â Socket between submit and execute hosts closed unexpectedly

Â Â Trying to reconnect to slot6@node3 <192.168.1.30:32998?addrs=192.168.1.30-32998>

...

023 (113.045.000) 03/07 23:57:31 Job reconnected to slot6@node3

Â Â startd address: <192.168.1.30:32998?addrs=192.168.1.30-32998>

Â Â starter address: <192.168.1.30:7920?addrs=192.168.1.30-7920>

...

007 (113.045.000) 03/07 23:57:31 Shadow exception!

Â Â Â Â Erroage of job (MB)

Â Â Â Â 2010192 Â- ÂResidentSetSize of job (KB)

...

022 (113.045.000) 03/07 23:57:31 Job disconnected, attempting to reconnect

Â Â Socket between submit and execute hosts closed unexpectedly

Â Â Trying to reconnect to slot6@node3 <192.168.1.30:32998?addrs=192.168.1.30-32998>

...

023 (113.045.000) 03/07 23:57:31 Job reconnected to slot6@node3

Â Â startd address: <192.168.1.30:32998?addrs=192.168.1.30-32998>

Â Â starter address: <192.168.1.30:7920?addrs=192.168.1.30-7920>

...

007 (113.045.000) 03/07 23:57:31 Shadow exception!

Â Â Â Â Error from slot6@node3: Couldn't reopen Logs/worker_113_45.out to stream stdout: Cannot allocate memory

Â Â Â Â 0 Â- ÂRun Bytes Sent By Job

Â Â Â Â 165274896 Â- ÂRun Bytes Received By Job

Âfrom slot6@node3: Couldn't reopen Logs/worker_113_45.out to stream stdout: Cannot allocate memory

Â Â Â Â 0 Â- ÂRun Bytes Sent By Job

Â Â Â Â 165274896 Â- ÂRun Bytes Received By Job

**********************************************

(2) Error message on submit node stdout:

[this is actually the TCP reappearance of the worker after the disconnection]

***********************************************

New worker has appeared: slot6@node3

Fortran runtime error: cannot allocate memory
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] Prevent slot from re-entering pool after Shadow exception/cannot allocate memory