[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Job dependencies without a DAG



I'm sure I'm not the only one to have run into a situation where you want a job to run after a previous one completes, but don't want to have the panoply of files and moving parts that come with setting up a DAGman workflow. In my case, I wanted to e-mail the completed output files of a previous job to the notification list for that job, rather than merely the limited information in the hardcoded template of the "notification = Always" e-mail.

Since the first job is running inside a Singularity container with no e-mail or HTCondor clients installed, I couldn't just run a PostCmd since that would run inside the container for the job. A single-node DAG would work since the POST command for a DAG node runs on the scheduler and thus could have handled the e-mailing, but that brings with it the half-dozen DAG log files in a specific directory. Since this job will be running repeatedly, the clean up from a DAG run would be a bit unwieldy.

In my case, I don't care whether the main job succeeds or fails, since I want to e-mail the ouput regardless.

Here's the trick:

    # Main job
    ...other main job definitions...
    Notify_user = michael.v.pelletier@xxxxxxxxxxxx
    Notification = Error
    Log = wafer-infer-$(Cluster).log

    # Queue the main job
    Queue 1

    # Define the follow-up job
    MAIL_FROM_NAME = Wafer Defect Mailer
    executable = /bin/sh
    arguments = " -c 'export J=$$([ClusterId-1]) ; condor_wait -wait 3600 wafer-infer-$(DOLLAR)J.log $(DOLLAR)J.0; tail -n +2 wafer-infer-$(DOLLAR)J.0.out | mailx -r ""$(MAIL_FROM_NAME) <no-reply@xxxxxxxxxxxx>"" -s ""Inference Results for job $(DOLLAR)J.0"" -a predictions-$(DOLLAR)J.0.json $$([NotifyUser]) ; rm -f wafer-infer-$(DOLLAR)J.log' "

This sets up an inline shell script as the job. Defining it inline here avoids the need to maintain a separate script file outside of the submit description. Breaking it down:

1. Since the executable is changing the Vanilla universe generates a new ClusterID for the second job, rather than incrementing the ProcID. As a result, we need to wait for the previous ClusterId, not the current one. Since the submission is in the same file, we can be assured that ClusterId - 1 represents the main job's ClusterId. We use condor_wait on the previous job's log file to wait for its completion.

2. I use "tail -n +2" to skip the first line of the output log because the Singularity container is from NVIDIA and it gripes if there's no NVIDIA binaries on the host system, and the gripe contains color-changing control characters which makes the mail client assume it's a non-ASCII message and attaches it as a .bin file instead of in the body of the mssage. In the case of my main job, which is just sending images to a Triton Inference Server, the lack of NVIDA binaries doesn't matter - no GPU is needed.

3. The "mailx" command uses the '-r' to set the sender, the subject line is set with "-s", the JSON file generated by the previous run is attached to the e-mail with "-a", and the recipients of the e-mail are pulled from the NotifyUser job attribute. I reckon I could also use "$(NOTIFY_USER)" there as well.

3a. I use the $$([]) expression to do the ClusterId math rather than the submit description macro because at a certain point in the submit process, the ClusterId hasn't yet been assigned, and defaults to 1, and with multiple Clusters in a single submission, some confusion for condor_submit tends to crop up.

4. Finally, the HTCondor log file needed by condor_wait is removed to avoid cluttering the output directory.

    START_DELAY = 60
    requirements = (CurrentTime - QDate > $(START_DELAY))

Next, I set up the follow-up job to wait 60 seconds after queueing before starting up, to give the main job a head start. You can set this delay to whatever's appropriate. Since it's a vanilla universe job, it consumes a memory and cpu allocation, so delaying its startup saves a bit of resource allocation. While the obvious approach here would be to run it as a "local" universe job, condor_submit isn't geared to allow a change of job universe in a single submit description. If you set "universe = local" the follow-up job won't start.

    description = $(MAIL_FROM_NAME) for $$([ClusterId-1]).0
    transfer_executable = false
    should_transfer_files = no
    transfer_input_files =
    MY.Args =
    MY.SingularityImage =
    MY.SingularityBindDirs =
    output = /dev/null
    error = /dev/null
    log = /dev/null
    request_memory = 1
    request_disk = 1
    rank = 0

All of the above override all the relevant settings for the main job, which input-transfers a batch of images and runs inside a Singularity container. Obviously, none of that is needed for a simple shell one-liner.

So upon submission, the two jobs will go into the queue, and the second job will wait 60 seconds before starting to run, and then wait for the completion of the first job and then e-mail its output and the .json file it generated to the addresses in a comma-separated notify_user list.

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company