[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor - Force a dag.rescue file? or other workaround



Hello Fernando,

Do you have any sense of what is causing this hiccup? If DAGMan is not logging certain job submissions, that is a very serious problem that we need to look into. Can you attach your .dagman.out file so we can take a look and try to understand what is happening?

As for the rescue file, this is supposed to get generated automatically. Can you check that the following configuration options are set:

DAGMAN_AUTO_RESCUE = true
DAGMAN_WRITE_PARTIAL_RESCUE = true

Which version of HTCondor are you running? This feature has been in our codebase for several years now, but it's possible you're running an older version before it was implemented.

Mark


On Thu, Jan 31, 2019 at 2:29 AM SCHAPIRA Fernando <fernando.schapira@xxxxxxxxxxxxxxxxxxxx> wrote:
  • When HTCondor is adding all of individual node jobs to the HTCondor DAGMan .dag file, it will have a little hiccup which prevents to properly record that one or two of the node job is successfully submitted.
  • All the individual node jobs finish, but because of the hiccup, DAGMan thinks some are not finished
  • As a result, DAGMan doesnât submit the next group of node jobs for the our processing application, our customers are wondering what to do?

 

When this happens, weâre manually copying the .dag file, and making a âdag.rescueâ. 

Weâd manually edit dag.rescue to tell which node jobs are done, and then condor_submit_dag the dag.rescue until processing can finish.

 

My question is, do you know a way to make HTCondor generate itâs own dag.rescue file?  Or a setting / workaround to avoid this behavior.

Weâve helped our customerâs with a case like this, the .dag file was HUGE lots of copy paste for the â DONEâ phrase to mark individual jobs status. 

Just wondering if you know a way to do it automatically.

 

 

Kind Regards,
Fernando M. Schapira
Senior Support Engineer

 

From: SCHAPIRA Fernando
Sent: Sunday, January 27, 2019 20:19
To: 'Greg Thain' <gthain@xxxxxxxxxxx>; 'John M Knoeller' <johnkn@xxxxxxxxxxx>; 'Todd Tannenbaum' <tannenba@xxxxxxxxxxx>
Subject: HTCondor - Force a dag.rescue file? or other workaround

 

Hi Greg, Hi JK,

 

  • When HTCondor is adding all of individual node jobs to the HTCondor DAGMan .dag file, it will have a little hiccup which prevents to properly record that one or two of the node job is successfully submitted.
  • All the individual node jobs finish, but because of the hiccup, DAGMan thinks some are not finished
  • As a result, DAGMan doesnât submit the next group of node jobs for the our processing application, our customers are wondering what to do?

 

When this happens, weâre manually copying the .dag file, and making a âdag.rescueâ. 

Weâd manually edit dag.rescue to tell which node jobs are done, and then condor_submit_dag the dag.rescue until processing can finish.

 

My question is, do you know a way to make HTCondor generate itâs own dag.rescue file?  Or a setting / workaround to avoid this behavior.

Weâve helped our customerâs with a case like this, the .dag file was HUGE lots of copy paste for the â DONEâ phrase to mark individual jobs status. 

Just wondering if you know a way to do it automatically.

 

Kind Regards,
Fernando M. Schapira
Senior Support Engineer

Pre-Sales and Commissioning Project Manager

Geospatial Content Solutions - GCS
*****************************************
Leica Geosystems AG
Heinrich-Wild-Strasse, 9435 Heerbrugg - Switzerland
Phone: +41 71 727 43 11, Fax: +41 71 727 43 01
e-mail:
fernando.schapira@xxxxxxxxxxxxxxxxxxxx
*********www.leica-geosystems.com*********

 

cid:image001.png@01D3BF62.8B757FB0

 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison
+1 608 206 4703