[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job Lease Duration / Jobs stop running



Hi,

I submitted 21 jobs to condor , out of which 8 stopped running. At some point of time, they were all running because the image size of the 8 idle jobs is quite big. I submitted another job and it started running immediately, but the other 8 remain idle. My job log file shows :

006 (056.000.000) 05/02 22:59:45 Image size of job updated: 1163528
...
022 (056.000.000) 05/02 22:59:51 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <128.226.128.45:38130>
...
022 (057.000.000) 05/02 23:00:04 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx < 128.226.128.45:38130>
...
024 (057.000.000) 05/02 23:00:04 Job reconnection failed
    Job disconnected too long: JobLeaseDuration (1200 seconds) expired
    Can not reconnect to vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job
...
022 (055.000.000) 05/02 23:00:43 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <128.226.128.45:38130>
...
024 (055.000.000) 05/02 23:00:43 Job reconnection failed
    Job disconnected too long: JobLeaseDuration (1200 seconds) expired
    Can not reconnect to vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job
...
024 (056.000.000) 05/02 23:19:51 Job reconnection failed
    Job disconnected too long: JobLeaseDuration (1200 seconds) expired
    Can not reconnect to vm3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , rescheduling job


Any hints ???

thanks !
Askar