[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [External] Restart submitted dag (Lunyang)



Thanks! Michael and Cole. All of your suggestions are very helpful. The OSError is about google cloud storage complains filesâ MD5 hash does not match. I guess it probably buried in some internal code, would be a hard fix to me.  I am try with improving the restarting dag workfow.

Side note on where to find attached script. I was wondering where I can find the attachment. I usually read response from daily digest and there is no script attachement. Then I realize the user groupâs archive webpage has attachment. I will try Coleâs script. 

Best,
Lunyang

> On Oct 27, 2023, at 10:31 AM, htcondor-users-request@xxxxxxxxxxx wrote:
> 
> Send HTCondor-users mailing list submissions to
> 	htcondor-users@xxxxxxxxxxx
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> or, via email, send a message with subject or body 'help' to
> 	htcondor-users-request@xxxxxxxxxxx
> 
> You can reach the person managing the list at
> 	htcondor-users-owner@xxxxxxxxxxx
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of HTCondor-users digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: [External]  Restart submitted dag (Cole Bollig)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 27 Oct 2023 14:29:53 +0000
> From: Cole Bollig <cabollig@xxxxxxxx>
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] [External]  Restart submitted dag
> Message-ID:
> 	<BYAPR06MB4088F610981BD049BBD18377AFDCA@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
> 	
> Content-Type: text/plain; charset="windows-1252"
> 
> Hi Lunyang,
> 
> I got ahead of myself and before I wrote up an answer, I quickly wrote a python script that can do what you desire using the htcondor python bindings. The script does the following based on a provided DAGMan job proper cluster id:
> 
>  1.  Queries the local schedd for a job matching the provided cluster id
>     *
> Verifies job was found in the queue, has the UserLog classad attribute defined, and that the found job is a DAGMan job
>  2.
> Removes the job from the Schedd queue
>     *
> Waits for confirmation of job removal
>  3.
> Parses the UserLog attribute to get the working directory and the dag filename
>  4.
> Changes directories to the working directory and submits the dag again
> 
> There are some assumptions occurring in this script that makes it not fully comprehensive like:
> 
>  1.
> The script assumes that the only one DAG file was submitted for this DAGMan process. By that I mean you didn't run condor_submit_dag first.dag second.dag (SUBDAGs will work fine)
>  2.
> This assumes that the DAG was submitted from the directory the DAG file is stored in.
>  3.
> The script assumes that this is being ran on the root DAG and never a SUBDAG (wonky things might occur otherwise)
>  4.
> The script is made with Unix based assumptions for pathing
> 
> I have attached the script. Feel free to check it out, use it, and/or modify it.
> Hope this helps,
> Cole Bollig
> ________________________________
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Pelletier, Michael V. RTX via HTCondor-users <htcondor-users@xxxxxxxxxxx>
> Sent: Friday, October 27, 2023 8:41 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Cc: Pelletier, Michael V. RTX <Michael.V.Pelletier@xxxxxxx>
> Subject: Re: [HTCondor-users] [External] Restart submitted dag
> 
> The view of HTCondor into the jobs it is running only extends to the level of the process and its ID number. The only way that HTCondor recognizes that a task is terminated is when the process terminates and delivers an exit code.
> 
> If the OSError is being caught in some way, and not resulting in the exit of the process, there's nothing visible to HTCondor that would indicate that it is not still running.
> 
> You can see this kind of behavior sometimes with certain versions of MATLAB - when you call it from the command line and the function call or routine you specified fails, it drops you to the MATLAB command prompt instead of exiting MATLAB, leaving the process hanging waiting for user input that will never come. I think the "-batch" command line option for MATLAB does an implicit exit(); after the function call, but it's also common to put that in the command line as well.
> 
> So, take a closer look at the failed task and see what's going on around it. Maybe a subprocess failed and the parent process didn't pass along that failure into its own termination and exit code. Remember, the "startd" starts the "starter," and the starter starts the executable/arguments. I find "pstree" useful for dissecting this sort of situation.
> 
> Michael Pelletier
> Principal Technologist
> High Performance Computing
> Infrastructure & Workplace Services
> 
> C: +1 339.293.9149
> michael.v.pelletier@xxxxxxx
> 
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of ???
> Sent: Thursday, October 26, 2023 8:41 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: [External] [HTCondor-users] Restart submitted dag
> 
> Hi there,
> 
> I have been using Dagman to organize workflows. It?s been great. Recently I run into issues where some dag has one or two tasks left not finished. condor_q just shows these two tasks kept running. The task?s stderr shows the task runs into OSError but condor does not stop the task. I have find remove the whole dag and resubmitted via rescue Dag fix the issue (error is unpredictable and transient). But to do that, I need to dig out the dag file I submitted previously. I have two questions:
> 
> * Are there smart way to remove a dag and resubmitted the dag either through CLI or python binding without knowing the location of dag file. Like some restart functionality of dag that recognized rescue dag.
> 
> * Are there known issues task would not recognized as terminated by htcondr ? I am using a OS debian 10. so I can only use htcondor 9 in my system. Probably there are bugs? and maybe I can set some job run max time as a workaround? Any idea which config I need to set for condor? For context, I am running condor in my personal computer. I can configure the pool.
> 
> Thanks a lot!
> 
> Best,
> Lunyang
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://urldefense.us/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIGaQ&c=MASr1KIcYm9UGIT-jfIzwQg1YBeAkaJoBtxV_4o83uQ&r=4PJgb1eyyvhzSV4fRwSECGK3jb50YP8vZUAedXybzgaNykar_o0SxKOUPkRHE0WG&m=mSAlYyj4nzWLkREmXxdJbW8GGSfsF4nfK4pRMxeAChdyCHeFiejvACuYtg7jG-QN&s=0zLoofQWlpAWvo2xdR0Mz9ZpnmvHLLQZ1sMbYykn6E8&e=
> 
> The archives can be found at:
> https://urldefense.us/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIGaQ&c=MASr1KIcYm9UGIT-jfIzwQg1YBeAkaJoBtxV_4o83uQ&r=4PJgb1eyyvhzSV4fRwSECGK3jb50YP8vZUAedXybzgaNykar_o0SxKOUPkRHE0WG&m=mSAlYyj4nzWLkREmXxdJbW8GGSfsF4nfK4pRMxeAChdyCHeFiejvACuYtg7jG-QN&s=agS-4wJQnrr0KLnqvD-GNREQ_zS_kl3CpfONXvucjtg&e=
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <https://www-auth.cs.wisc.edu/lists/htcondor-users/attachments/20231027/e5e2f243/attachment.html>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: dag_restart
> Type: application/octet-stream
> Size: 2923 bytes
> Desc: dag_restart
> URL: <https://www-auth.cs.wisc.edu/lists/htcondor-users/attachments/20231027/e5e2f243/attachment.obj>
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> HTCondor-users mailing list
> HTCondor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> ------------------------------
> 
> End of HTCondor-users Digest, Vol 119, Issue 59
> ***********************************************