[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Appending file output for a vanilla job

Hi Duncan,

if the results are small enough, maybe you can use `condor_chirp` from within the job to store/update the results as class ads? [1] Alternatively, with condor_chirp the job could probably send a status/result file back or write its results into the job log (with a "grep'able" tag in the log, the results could maybe be harvested from the collected job logs)

If your jobs' workflows are somewhat complex, maybe they can be realized as a DAG [2] - but that might be overkill for just a few simple jobs.




On 03/03/2021 22.50, Duncan Brown via HTCondor-users wrote:
Hi all,

I'm trying to do something that feels like it should be HTCondor 101, but I am failing to figure it out:

We have a python program running in the vanilla universe that generates looks like

while True:
    s = random_number_from( /dev/urandom )
    result = calculation_that_takes_about_ten_minutes( s )

The jobs are running on our OrangeGrid which consists of transient execute machines that have an average lifetime of 4 hours. We have

output = result.$(cluster).$(process)
stream_output = true

We then accumulate a bunch of results by cat-ing result.$(cluster).$(process) together. This works great while the jobs are running.

The problem is that if a job gets evicted by the execute machine and restarted, then the stdout file gets clobbered when the job starts back up again. We would just like to accumulate results from a bunch of jobs. The result files are simple enough that if the job got evicted while it was writing an ascii line to stdout, we can filter that out.

I cannot figure out how to prevent condor from clobbering stdout when the job is restarted. I also can't figure out how to stream to files that are not stdout or stderr. Writing to a specific file and using append_files won't work, as the code is python and not standard universe. The only solution I can come up with is to:

1. Add transfer_input_file = result.$(cluster).$(process) to my submit file,

2. Submit the job into the held state to get the $(cluster) number,

3. Touch a bunch of result.$(cluster).$(process) files so they exist and are zero bytes.

4. Have my program cat result.$(cluster).$(process) to stdout at startup

5. Write print(result) to stdout and have condor stream stdout.

It feels like there has to be an easier way of doing this. What's the obvious thing that I'm missing?


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature