[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] rarely - file not transferred



An announcement for HTCondor 7.9.4 just came out, which got me thinking about upgrading from 7.8.4 to perhaps fix an issue.

My use of HTCondor results in each job producing a text file with summary information and an optional data file. The text file is used as a trigger to look for the associated data file, because the data file is not always produced. Submittals might contain thousands of jobs and each produces these output(s). Occasionally (4 times in ~4 months) at least one of those text files was not transferred at job completion and the program that manages the HTCondor submits (a task manager on the submit machine) waited for the text file to show up, but it never did. After the first time this happened, matching data files were transferred to the submit machine. Every time this has happened, condor_status eventually showed that all jobs completed, but yet the task manager waited. The first time this happened there were a number of missing text files (I didn't count, 10-20) and I don't know if any data files were missing.

Because the data output file is optional, if it fails to transfer, the task manager would not know. It didn't occur to me that this could be a problem. A workaround would be to put a flag in the text file indicating that a data file was produced to be sure to look for it, but the part that bothers me is that not all the files are making it back to the task manager.

Is this a network reliability issue? Is this something that other users have seen? Has it been addressed in a patch since version 7.8.4?