[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Strange Condor Behavior - Possible Bug



I’m having a hard time troubleshooting a problem with seeing with our HTCondor jobs and was hoping to get some help from the community.  I looked at a lot of the online documentation and even consulted with our internal subject matter experts and we still can’t figure out what’s going on.  Basically, we’re submitting 121 test programs to be run on an HTCondor cluster of 20 machines with 12 CPU cores each.  Each test has a “launcher” script that we create and gets passed to the executing host. These test programs run under valgrind and a few of them go for several hours.   After submitting the jobs, we have a process that waits up to 12 hours for the tests to complete.  If after 12 hours, all jobs have not completed, it assumes there is a problem and removes the remaining jobs.  All of the tests except for 10 run successfully as HTCondor jobs.  For about 10 of them, something strange happens.  The programs run for a while and then then we get messages that the job disconnected and job reconnection failed.  So, the job gets restarted.  This pattern usually repeats a few times.  In one of these repeated executions, the test program completes successfully.  The STDOUT and STDERR get transferred back to the submitting host.  However, the condor_q still shows the job as running.  So, what’s happening is that the launcher script is completing successfully, but condor still thinks the job is running.  What’s puzzling is that the .sub file specifies to transfer files back to the submitting host ON_EXIT and we see the STDOUT and STDERR indicating that the job completed.  Yet, HTCondor still reports the job as running.  How is this possible ?  We are sure that the output files aren’t stale because all files are removed before the build and test.  We also checked out network statistics and there does not appear to be anything unusual going on that would cause the socket to close between the submitting host and the executing host.  Does HTCondor have a bug in how it handles jobs that are restarted ?  Also, does HTCondor try to detect if a job is hung ?  We think this might be what’s going on because these are some of the longer running tests.  I have included some of a test’s log file below.  In this specific instance, the test succeeded and the STDOUT/STDERR was successfully transferred back to the submitting host at 11:46AM.  Any help would be greatly appreciated as we are really clueless as to what’s going on.

 

                Kris Wempa

 

 

000 (4323.000.000) 09/28 01:26:29 Job submitted from host: <10.11.129.10:39405>

...

001 (4323.000.000) 09/28 01:26:43 Job executing on host: <10.11.132.73:37914?soc

k=9516_67e9_3>

 

006 (4323.000.000) 09/28 01:26:52 Image size of job updated: 325032

        78  -  MemoryUsage of job (MB)

        79152  -  ResidentSetSize of job (KB)

...

006 (4323.000.000) 09/28 01:31:53 Image size of job updated: 2152844

        1664  -  MemoryUsage of job (MB)

        1703752  -  ResidentSetSize of job (KB)

 

{   Several more heart beat messages }

 

022 (4323.000.000) 09/28 03:29:24 Job disconnected, attempting to reconnect

    Socket between submit and execute hosts closed unexpectedly

    Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.73:37914

?sock=9516_67e9_3>

...

024 (4323.000.000) 09/28 03:29:24 Job reconnection failed

    Job not found at execution machine

    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

...

001 (4323.000.000) 09/28 03:29:54 Job executing on host: <10.11.132.72:51529?soc

k=27166_f857_3>

...

022 (4323.000.000) 09/28 05:24:06 Job disconnected, attempting to reconnect

    Socket between submit and execute hosts closed unexpectedly

    Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.72:51529

?sock=27166_f857_3>

...

024 (4323.000.000) 09/28 05:24:06 Job reconnection failed

    Job not found at execution machine

    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

...

001 (4323.000.000) 09/28 05:24:41 Job executing on host: <10.11.132.72:51529?soc

k=27166_f857_3>

...

022 (4323.000.000) 09/28 07:05:35 Job disconnected, attempting to reconnect

    Socket between submit and execute hosts closed unexpectedly

    Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.72:51529

?sock=27166_f857_3>

...

024 (4323.000.000) 09/28 07:05:35 Job reconnection failed

    Job not found at execution machine

    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

...

001 (4323.000.000) 09/28 07:06:14 Job executing on host: <10.11.132.72:51529?soc

k=27166_f857_3>

...

022 (4323.000.000) 09/28 09:01:37 Job disconnected, attempting to reconnect

    Socket between submit and execute hosts closed unexpectedly

    Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.72:51529

?sock=27166_f857_3>

...

024 (4323.000.000) 09/28 09:01:45 Job reconnection failed

    Job not found at execution machine

    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

...

001 (4323.000.000) 09/28 09:02:15 Job executing on host: <10.11.132.72:51529?soc

k=27166_f857_3>

...

022 (4323.000.000) 09/28 10:40:54 Job disconnected, attempting to reconnect

    Socket between submit and execute hosts closed unexpectedly

    Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.72:51529

?sock=27166_f857_3>

...

024 (4323.000.000) 09/28 10:40:54 Job reconnection failed

    Job not found at execution machine

    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

...

001 (4323.000.000) 09/28 10:41:19 Job executing on host: <10.11.132.72:51529?soc

k=27166_f857_3>

...

022 (4323.000.000) 09/28 12:45:39 Job disconnected, attempting to reconnect

    Socket between submit and execute hosts closed unexpectedly

    Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.72:51529

?sock=27166_f857_3>

...

024 (4323.000.000) 09/28 12:45:39 Job reconnection failed

    Job not found at execution machine

    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

...

001 (4323.000.000) 09/28 12:45:57 Job executing on host: <10.11.132.72:51529?soc

k=27166_f857_3>

...

004 (4323.000.000) 09/28 13:31:14 Job was evicted.

        (0) Job was not checkpointed.

                Usr 0 00:41:38, Sys 0 00:02:18  -  Run Remote Usage

                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage

        0  -  Run Bytes Sent By Job

        2013  -  Run Bytes Received By Job

        Partitionable Resources :    Usage  Request Allocated

           Cpus                 :                 1         1

           Disk (KB)            :      150      150   6747722

           Memory (MB)          :     1709     1709      1709

...




IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.