[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Add jobs to a dag from a running job



Hi David,

What is the context of an epoch in this scenario? Is an epoch an internal cycle within the deep learning job itself or is it an HTCondor job execution? If it is the latter, then you could have a DAG with a node that runs the deep learning job that has a retry of n attempts where n is the number of desired epochs (60), and a post script that does two things. First, it checks if the execution attempt is an interval of 5 (or a different step if desired) to begin analysis somehow. Second, is check the current execution attempts exit code and retry number. If the exit status was successful and the retry/attempt number is not the max (60) exit non-zero to fail the node to start a new epoch.

As for starting the analysis, things may get complex since DAGMan does not have a native way to add jobs to the DAG on the fly. You could use the post script to submit another job to the queue. However, that job will not be run be in the DAGs scope and not managed by DAGMan. Otherwise, the analysis could be a long running job that is part of the DAG and somehow it needs to be communicated when and what to analyze.

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Dudu Handelman <duduhandelman@xxxxxxxxxxx>
Sent: Monday, January 8, 2024 6:47 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Add jobs to a dag from a running job
 
Hi All.
I wonder what will be the best solution.

Just an example:
While running a deep learning job  with 60 epoch's I wish to run evaluation every 5 epoch's.
The evaluation is async and can run in parallel with the train job. 

One solution is creating a dag the training job will exit every 5 epoch's run evaluation job and next job will continue with the next epoch's.

Another way might be using a dag with and service node the job will use condor_chrip to update the progress and the script (service node) will send evaluation job according the job progress.


Maybe there is better way?

Thanks
David