[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Properly terminated jobs get restarted



Dear all,

for a couple of weeks we have the problem, that when a job terminates a shadow exception occurs and the job is marked as idle and condor restarts again. According to the ShadowLog the exception is due to a handshake failure (see the excerpt below). The error number 104 indicates that the connection is reset by the peer.
This behavior came up at the mid of march three weeks after the last changes of our condor configuration. Does anybody can give me a hint how to solve this problem? We have condor-6.6.8 running on a couple of x86 workstations and on a xeon-cluster.




excerpt from ShadowLog:

794:414755:4/20 15:33:20 (fd:5) (4623.17) (27035):SECMAN: Security Policy:
795:414773:4/20 15:33:20 (fd:5) (4623.17) (27035):SECMAN: negotiating security for command 1111.
796:414774:4/20 15:33:20 (fd:5) (4623.17) (27035):SECMAN: sending DC_AUTHENTICATE command
797:414775:4/20 15:33:20 (fd:5) (4623.17) (27035):SECMAN: sending following classad:
798:414793:4/20 15:33:20 (fd:5) (4623.17) (27035):SECMAN: startCommand succeeded.
799:414794:4/20 15:33:20 (fd:5) (4623.17) (27035):AUTHENTICATE: in authenticate( addr == '<134.76.88.65:49911>', methods == 'FS,KERBEROS,GSI')
800:414795:4/20 15:33:20 (fd:5) (4623.17) (27035):AUTHENTICATE: can still try these methods: FS,KERBEROS,GSI
801:414796:4/20 15:33:20 (fd:5) (4623.17) (27035):HANDSHAKE: in handshake(my_methods = 'FS,KERBEROS,GSI')
802:414797:4/20 15:33:20 (fd:5) (4623.17) (27035):HANDSHAKE: handshake() - i am the client
803:414798:4/20 15:33:20 (fd:5) (4623.17) (27035):HANDSHAKE: sending (methods == 100) to server
804:414799:4/20 15:33:20 (fd:5) (4623.17) (27035):condor_read(): recv() returned -1, errno = 104, assuming failure.
805:414800:4/20 15:33:20 (fd:5) (4623.17) (27035):AUTHENTICATE: handshake failed!
806:414801:4/20 15:33:20 (fd:5) (4623.17) (27035):AUTHENTICATE: auth_status == 0 (?!?)
807:414802:4/20 15:33:20 (fd:5) (4623.17) (27035):Authentication was a FAILURE.
808:414803:4/20 15:33:20 (fd:5) (4623.17) (27035):CLOSE <134.76.88.65:58830> fd=4
809:414804:4/20 15:33:20 (fd:4) (4623.17) (27035):Authentication Error
810:414806:4/20 15:33:20 (fd:4) (4623.17) (27035):Destroying Daemon object:
811:414807:4/20 15:33:20 (fd:4) (4623.17) (27035):Type: 3 (schedd), Name: (null), Addr: <134.76.88.65:49911>
812:414808:4/20 15:33:20 (fd:4) (4623.17) (27035):FullHost: (null), Host: (null), Pool: (null), Port: 49911
813:414809:4/20 15:33:20 (fd:4) (4623.17) (27035):IsLocal: N, IdStr: (null), Error: (null)
814:414810:4/20 15:33:20 (fd:4) (4623.17) (27035): --- End of Daemon object info ---
815:414811:4/20 15:33:20 (fd:4) (4623.17) (27035):ERROR "Failed to connect to schedd!" at line 1000 in file shadow.C
816:414812:4/20 15:33:20 (fd:4) (4623.17) (27035):PRIV_CONDOR --> PRIV_USER at user_log.C:240
817:414813:4/20 15:33:20 (fd:4) (4623.17) (27035):PRIV_USER --> PRIV_CONDOR at user_log.C:292
818:414814:4/20 15:33:20 (fd:4) (4623.17) (27035):Shadow: Entered DoCleanup()
819:414815:4/20 15:33:20 (fd:4) (4623.17) (27035):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/hosts/Sisko/spool/cluster4623.proc17.subproc0.tmp'
820:414816:4/20 15:33:20 (fd:4) (4623.17) (27035):Trying to unlink /home/condor/hosts/Sisko/spool/cluster4623.proc17.subproc0.tmp
821:414817:4/20 15:33:20 (fd:4) (4623.17) (27035):Remove from ckpt server returns -1


With best regards,

Stephan Kramer

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% % % Stephan Kramer %
% Institut fuer Theoretische Physik %
% Universitaet Goettingen %
% Friedrich-Hund-Platz 1 %
% 37077 Goettingen %
% %
% phone: +49-551-39-9567 %
% fax: +49-551-39-9631 %
% %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%