[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Appending file output for a vanilla job



Hi Thomas,

Thanks, it looks like chirp is the solution. I can use condor_chrip -put to send the results back to the submit machine.

Cheers,
Duncan.

> On Mar 4, 2021, at 4:28 AM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
> 
> Hi Duncan,
> 
> if the results are small enough, maybe you can use `condor_chirp` from within the job to store/update the results as class ads? [1] Alternatively, with condor_chirp the job could probably send a status/result file back or write its results into the job log (with a "grep'able" tag in the log, the results could maybe be harvested from the collected job logs)
> 
> If your jobs' workflows are somewhat complex, maybe they can be realized as a DAG [2] - but that might be overkill for just a few simple jobs.
> 
> Cheers,
>  Thomas
> 
> 
> [1]
> https://htcondor.readthedocs.io/en/latest/man-pages/condor_chirp.html?highlight=condor_chirp
> 
> 
> [2]
> https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#capturing-the-status-of-nodes-in-a-file
> 
> 
> On 03/03/2021 22.50, Duncan Brown via HTCondor-users wrote:
>> Hi all,
>> I'm trying to do something that feels like it should be HTCondor 101, but I am failing to figure it out:
>> We have a python program running in the vanilla universe that generates looks like
>> while True:
>>    s = random_number_from( /dev/urandom )
>>    result = calculation_that_takes_about_ten_minutes( s )
>>    print(result)
>> The jobs are running on our OrangeGrid which consists of transient execute machines that have an average lifetime of 4 hours. We have
>> output = result.$(cluster).$(process)
>> stream_output = true
>> We then accumulate a bunch of results by cat-ing result.$(cluster).$(process) together. This works great while the jobs are running.
>> The problem is that if a job gets evicted by the execute machine and restarted, then the stdout file gets clobbered when the job starts back up again. We would just like to accumulate results from a bunch of jobs. The result files are simple enough that if the job got evicted while it was writing an ascii line to stdout, we can filter that out.
>> I cannot figure out how to prevent condor from clobbering stdout when the job is restarted. I also can't figure out how to stream to files that are not stdout or stderr. Writing to a specific file and using append_files won't work, as the code is python and not standard universe. The only solution I can come up with is to:
>> 1. Add transfer_input_file = result.$(cluster).$(process) to my submit file,
>> 2. Submit the job into the held state to get the $(cluster) number,
>> 3. Touch a bunch of result.$(cluster).$(process) files so they exist and are zero bytes.
>> 4. Have my program cat result.$(cluster).$(process) to stdout at startup
>> 5. Write print(result) to stdout and have condor stream stdout.
>> It feels like there has to be an easier way of doing this. What's the obvious thing that I'm missing?
>> Cheers,
>> Duncan.
> 

-- 

Duncan Brown                              Room 263-1, Physics Department
Charles Brightman Professor of Physics     Syracuse University, NY 13244
Physics Graduate Program Director     http://dabrown.expressions.syr.edu