[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Appending file output for a vanilla job

The design of HTCondor in vanilla presumes that restarting a job means restarting it from the beginning, so appending a prior run's stdout doesn't fit well in that model. You may find it easier to write the output to a separate file that doesn't have special meaning to HTCondor. You could have it check if it's running as a condor job and behave differently by seeing if the environment variable _CONDOR_SCRATCH_DIR is defined.

You might also be able to set up a PostCmd to copy the _condor_stdout file to a new file name:

MY.PostCmd = "../../../../../../../../bin/sh"
MY.PostArguments = "-c '/bin/cp -p _condor_stdout savedoutput.txt' "

Haven't tried this myself, so caveat emptor. Not sure if this runs during an eviction. Probably simpler to update the code to write the output directly.

You'll also want to take a look at the Vanilla Universe Checkpoint capability. I think that'll give you a more elegant solution. It's a supported feature in the 8.9 release, but it's implemented in the 8.8 release without any documentation or condor_submit keywords. This means you need to set the ClassAds directly like PostCmd, rather than using a submit macro like "checkpoint_exit_code."


Michael V Pelletier
Principal Engineer

Raytheon Technologies
Digital Technology
HPC Support Team

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Duncan Brown via HTCondor-users
Sent: Wednesday, March 3, 2021 4:51 PM
To: htcondor-users@xxxxxxxxxxx
Cc: Duncan Brown <dabrown@xxxxxxx>; Erick Leon <eleon02@xxxxxxx>
Subject: [External] [HTCondor-users] Appending file output for a vanilla job

Hi all,

I'm trying to do something that feels like it should be HTCondor 101, but I am failing to figure it out:

We have a python program running in the vanilla universe that generates looks like

while True:
   s = random_number_from( /dev/urandom )
   result = calculation_that_takes_about_ten_minutes( s )

The jobs are running on our OrangeGrid which consists of transient execute machines that have an average lifetime of 4 hours. We have

output = result.$(cluster).$(process)
stream_output = true

We then accumulate a bunch of results by cat-ing result.$(cluster).$(process) together. This works great while the jobs are running.

The problem is that if a job gets evicted by the execute machine and restarted, then the stdout file gets clobbered when the job starts back up again. We would just like to accumulate results from a bunch of jobs. The result files are simple enough that if the job got evicted while it was writing an ascii line to stdout, we can filter that out.

I cannot figure out how to prevent condor from clobbering stdout when the job is restarted. I also can't figure out how to stream to files that are not stdout or stderr. Writing to a specific file and using append_files won't work, as the code is python and not standard universe. The only solution I can come up with is to:

1. Add transfer_input_file = result.$(cluster).$(process) to my submit file,

2. Submit the job into the held state to get the $(cluster) number,

3. Touch a bunch of result.$(cluster).$(process) files so they exist and are zero bytes.

4. Have my program cat result.$(cluster).$(process) to stdout at startup

5. Write print(result) to stdout and have condor stream stdout.

It feels like there has to be an easier way of doing this. What's the obvious thing that I'm missing?



Duncan Brown                              Room 263-1, Physics Department
Charles Brightman Professor of Physics     Syracuse University, NY 13244
Physics Graduate Program Director     http://dabrown.expressions.syr.edu

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: