[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to forbid job restarts



Just in case attachments are stripped out in the mailing list, here is the log file.

 

--------

 

000 (676395.000.000) 01/10 23:48:35 Job submitted from host: <10.82.184.49:9618?addrs=10.82.184.49-9618+[--1]-9618&noUDP&sock=3463746_3b81_4>

    DAG Node: SomeDagNodeName

...

001 (676395.000.000) 01/11 00:23:41 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

006 (676395.000.000) 01/11 00:23:51 Image size of job updated: 17260

                17  -  MemoryUsage of job (MB)

                17260  -  ResidentSetSize of job (KB)

...

006 (676395.000.000) 01/11 00:28:51 Image size of job updated: 2501404

                2443  -  MemoryUsage of job (MB)

                2501068  -  ResidentSetSize of job (KB)

...

006 (676395.000.000) 01/11 01:18:56 Image size of job updated: 2501408

                2443  -  MemoryUsage of job (MB)

                2501068  -  ResidentSetSize of job (KB)

...

001 (676395.000.000) 01/11 01:20:30 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 02:20:48 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 03:21:58 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 04:15:49 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 05:11:34 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 06:00:36 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 06:55:27 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 07:52:15 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 08:49:41 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 09:52:17 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 11:02:24 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 12:04:34 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 13:02:33 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 13:42:04 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

001 (676395.000.000) 01/11 14:44:20 Job executing on host: <10.82.176.49:9618?addrs=10.82.176.49-9618+[--1]-9618&noUDP&sock=81524_42f8_3>

...

 

From: Vaurynovich, Siarhei
Sent: Monday, 11 January, 2021 14:59
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: RE: [HTCondor-users] How to forbid job restarts

 

 

Hello Greg,

 

Thank you! Of course, I have attached one of today’s logs with Unix/Windows end of the line.

 

Best,

Siarhei.

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Greg Thain
Sent: Monday, 11 January, 2021 14:48
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] How to forbid job restarts

 

  External Email. Use caution when clicking links or opening file attachments.

 

On 1/11/21 1:45 PM, Vaurynovich, Siarhei wrote:

 

Hello Stefano,

 

Thank you for your reply!

 

According to the documentation, It seems, that RETRY command does not correspond to my problem:

 

https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#retrying-failed-nodes

 

Can you share with us your job log files, which might give some idea into why condor is preempting these jobs? 

-greg

............................................................................

Trading instructions sent electronically to Bernstein shall not be deemed
accepted until a representative of Bernstein acknowledges receipt
electronically or by telephone.  Comments in this e-mail transmission and
any attachments are part of a larger body of investment analysis. For our
research reports, which contain information that may be used to support
investment decisions, and disclosures see our website at
www.bernsteinresearch.com.

For further important information about AllianceBernstein please click here
http://www.alliancebernstein.com/disclaimer/email/disclaimer.html