[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor JobDisconnectedEvent::writeEvent() called without startd_addr error



Hi,
    I am placed in the unenviable position as a communicator of a problem without privileges or understanding of underlying programmatic architecture, so please bear with me.

We have a user that can complete his ABAQUS job using MPI to distribute the model out over cluster nodes using Condor when he only request 8 compute elements .  However, larger runs (of 24 ce's ) terminate with the included shadow exception:

Subject: Condor JobDisconnectedEvent::writeEvent() called without startd_addr

I am running an ABAQUS analysis on a cluster using Condor and MPI. However, after running for several hours, the job is shutting down without any reason related to ABAQUS. The condor log is showing a message as follows:
 
022 (461459.000.000) 10/10 13:59:48 007 (461459.000.000) 10/10 13:59:48
Shadow exception!
        JobDisconnectedEvent::writeEvent() called without startd_addr


We have tried several corrective actions based on our assumption that this is a network/filesystem issue ( a specific file not available when needed), that include:
1- move the NFS based filesystem from NAT translation through the head node of the cluster to be directly connected via ethernet ports to each node in the cluster
2- changed NFS underlying protocol to use TCP instead of UDP

Does anyone have suggestions on what information I need to ask our systems group to capture , such as packet data or sockets being used at the time of the error being thrown, in order to trouble shoot this problem? Thank you for any suggestions to help me through this. Thanks,

Brandon

Brandon Leeds
Lehigh University