[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [External] Restart submitted dag



The view of HTCondor into the jobs it is running only extends to the level of the process and its ID number. The only way that HTCondor recognizes that a task is terminated is when the process terminates and delivers an exit code.

If the OSError is being caught in some way, and not resulting in the exit of the process, there's nothing visible to HTCondor that would indicate that it is not still running.

You can see this kind of behavior sometimes with certain versions of MATLAB - when you call it from the command line and the function call or routine you specified fails, it drops you to the MATLAB command prompt instead of exiting MATLAB, leaving the process hanging waiting for user input that will never come. I think the "-batch" command line option for MATLAB does an implicit exit(); after the function call, but it's also common to put that in the command line as well.

So, take a closer look at the failed task and see what's going on around it. Maybe a subprocess failed and the parent process didn't pass along that failure into its own termination and exit code. Remember, the "startd" starts the "starter," and the starter starts the executable/arguments. I find "pstree" useful for dissecting this sort of situation.

Michael Pelletier
Principal Technologist
High Performance Computing
Infrastructure & Workplace Services

C: +1 339.293.9149
michael.v.pelletier@xxxxxxx

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of ???
Sent: Thursday, October 26, 2023 8:41 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [External] [HTCondor-users] Restart submitted dag

Hi there,

I have been using Dagman to organize workflows. Itâs been great. Recently I run into issues where some dag has one or two tasks left not finished. condor_q just shows these two tasks kept running. The taskâs stderr shows the task runs into OSError but condor does not stop the task. I have find remove the whole dag and resubmitted via rescue Dag fix the issue (error is unpredictable and transient). But to do that, I need to dig out the dag file I submitted previously. I have two questions:

* Are there smart way to remove a dag and resubmitted the dag either through CLI or python binding without knowing the location of dag file. Like some restart functionality of dag that recognized rescue dag.

* Are there known issues task would not recognized as terminated by htcondr ? I am using a OS debian 10. so I can only use htcondor 9 in my system. Probably there are bugs? and maybe I can set some job run max time as a workaround? Any idea which config I need to set for condor? For context, I am running condor in my personal computer. I can configure the pool.

Thanks a lot!

Best,
Lunyang
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.us/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIGaQ&c=MASr1KIcYm9UGIT-jfIzwQg1YBeAkaJoBtxV_4o83uQ&r=4PJgb1eyyvhzSV4fRwSECGK3jb50YP8vZUAedXybzgaNykar_o0SxKOUPkRHE0WG&m=mSAlYyj4nzWLkREmXxdJbW8GGSfsF4nfK4pRMxeAChdyCHeFiejvACuYtg7jG-QN&s=0zLoofQWlpAWvo2xdR0Mz9ZpnmvHLLQZ1sMbYykn6E8&e=

The archives can be found at:
https://urldefense.us/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIGaQ&c=MASr1KIcYm9UGIT-jfIzwQg1YBeAkaJoBtxV_4o83uQ&r=4PJgb1eyyvhzSV4fRwSECGK3jb50YP8vZUAedXybzgaNykar_o0SxKOUPkRHE0WG&m=mSAlYyj4nzWLkREmXxdJbW8GGSfsF4nfK4pRMxeAChdyCHeFiejvACuYtg7jG-QN&s=agS-4wJQnrr0KLnqvD-GNREQ_zS_kl3CpfONXvucjtg&e=