[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor - Force a dag.rescue file? or other workaround



  • When HTCondor is adding all of individual node jobs to the HTCondor DAGMan .dag file, it will have a little hiccup which prevents to properly record that one or two of the node job is successfully submitted.
  • All the individual node jobs finish, but because of the hiccup, DAGMan thinks some are not finished
  • As a result, DAGMan doesn’t submit the next group of node jobs for the our processing application, our customers are wondering what to do?

 

When this happens, we’re manually copying the .dag file, and making a “dag.rescue”. 

We’d manually edit dag.rescue to tell which node jobs are done, and then condor_submit_dag the dag.rescue until processing can finish.

 

My question is, do you know a way to make HTCondor generate it’s own dag.rescue file?  Or a setting / workaround to avoid this behavior.

We’ve helped our customer’s with a case like this, the .dag file was HUGE lots of copy paste for the ‘ DONE’ phrase to mark individual jobs status. 

Just wondering if you know a way to do it automatically.

 

 

Kind Regards,
Fernando M. Schapira
Senior Support Engineer

 

From: SCHAPIRA Fernando
Sent: Sunday, January 27, 2019 20:19
To: 'Greg Thain' <gthain@xxxxxxxxxxx>; 'John M Knoeller' <johnkn@xxxxxxxxxxx>; 'Todd Tannenbaum' <tannenba@xxxxxxxxxxx>
Subject: HTCondor - Force a dag.rescue file? or other workaround

 

Hi Greg, Hi JK,

 

  • When HTCondor is adding all of individual node jobs to the HTCondor DAGMan .dag file, it will have a little hiccup which prevents to properly record that one or two of the node job is successfully submitted.
  • All the individual node jobs finish, but because of the hiccup, DAGMan thinks some are not finished
  • As a result, DAGMan doesn’t submit the next group of node jobs for the our processing application, our customers are wondering what to do?

 

When this happens, we’re manually copying the .dag file, and making a “dag.rescue”. 

We’d manually edit dag.rescue to tell which node jobs are done, and then condor_submit_dag the dag.rescue until processing can finish.

 

My question is, do you know a way to make HTCondor generate it’s own dag.rescue file?  Or a setting / workaround to avoid this behavior.

We’ve helped our customer’s with a case like this, the .dag file was HUGE lots of copy paste for the ‘ DONE’ phrase to mark individual jobs status. 

Just wondering if you know a way to do it automatically.

 

Kind Regards,
Fernando M. Schapira
Senior Support Engineer

Pre-Sales and Commissioning Project Manager

Geospatial Content Solutions - GCS
*****************************************
Leica Geosystems AG
Heinrich-Wild-Strasse, 9435 Heerbrugg - Switzerland
Phone: +41 71 727 43 11, Fax: +41 71 727 43 01
e-mail:
fernando.schapira@xxxxxxxxxxxxxxxxxxxx
*********www.leica-geosystems.com*********

 

cid:image001.png@01D3BF62.8B757FB0