[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] long term jobs on windows never complete



Dear friends,

 

I want to run a long term job about 10 days long on nondedicated and always –turned –on windows machines.  my questions:

 

why are they evicted instead of being suspended ?
is there any policy to  make these job complete ?
 Which policy should I choose on condor_config?

 

 

 I really appreciate any help

 

An example of the log file of this kind job is as follows:
the job never ended  always be killed on evicting time.

 

 

 

 

001 (354.000.000) 03/25 02:08:17 Job executing on host: < 194.225.71.156:1031 >

...

010 (354.000.000) 03/26 09:08:58 Job was suspended.

            Number of processes actually suspended: 1

...

011 (354.000.000) 03/26 09:14:57 Job was unsuspended.

...

004 (354.000.000) 03/26 12:45:44 Job was evicted.

            (0) Job was not checkpointed.

                        Usr 0 00:00:00, Sys 0 00:00:00  -   Run Remote Usage

                        Usr 0 00:00:00, Sys 0 00:00:00  -   Run Local Usage

            0  -  Run Bytes Sent By Job

            100864  -  Run Bytes Received By Job

...

001 (354.000.000) 03/26 12:53:02 Job executing on host: < 194.225.74.172:2936>

...

010 (354.000.000) 03/27 04:10:04 Job was suspended.

            Number of processes actually suspended: 1

...

011 (354.000.000) 03/27 04:19:13 Job was unsuspended.

...

010 (354.000.000) 03/27 09:03:19 Job was suspended.

            Number of processes actually suspended: 1

...

011 (354.000.000) 03/27 09:06:46 Job was unsuspended.

...

010 (354.000.000) 03/27 09:50:40 Job was suspended.

            Number of processes actually suspended: 1

...

011 (354.000.000) 03/27 09:51:26 Job was unsuspended.

...

010 (354.000.000) 03/27 09:55:16 Job was suspended.

            Number of processes actually suspended: 1

...

011 (354.000.000) 03/27 10:02:03 Job was unsuspended.

...

010 (354.000.000) 03/27 10:04:15 Job was suspended.

            Number of processes actually suspended: 1

...

011 (354.000.000) 03/27 10:10:14 Job was unsuspended.

...

010 (354.000.000) 03/27 10:21:51 Job was suspended.

            Number of processes actually suspended: 1

...

011 (354.000.000) 03/27 10:25:27 Job was unsuspended.

...

010 (354.000.000) 03/27 10:27:48 Job was suspended.

            Number of processes actually suspended: 1

...

011 (354.000.000) 03/27 10:35:48 Job was unsuspended.

...

010 (354.000.000) 03/27 10:41:42 Job was suspended.

            Number of processes actually suspended: 1

...

011 (354.000.000) 03/27 10:51:42 Job was unsuspended.

...

004 (354.000.000) 03/27 10:51:43 Job was evicted.

            (0) Job was not checkpointed.

                        Usr 0 00:00:00, Sys 0 00:00:00  -   Run Remote Usage

                        Usr 0 00:00:00, Sys 0 00:00:00  -   Run Local Usage

            0  -  Run Bytes Sent By Job

            100864  -  Run Bytes Received By Job

...

001 (354.000.000) 03/27 10:59:17 Job executing on host: < 194.225.74.21:1045>

...

004 (354.000.000) 03/27 10:59:29 Job was evicted.

            (0) Job was not checkpointed.

                        Usr 0 00:00:00, Sys 0 00:00:00  -   Run Remote Usage

                        Usr 0 00:00:00, Sys 0 00:00:00  -   Run Local Usage

            0  -  Run Bytes Sent By Job

            100864  -  Run Bytes Received By Job

 

 

Best

Majid