[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Problem with job submission with large input



Hello,

I've got a job submission that accepts a list of input directories as an argument on the condor_submit command line, and when that list grows beyond a certain point, the submission fails:

Submitting job(s)
08/18/20 11:10:44 condor_write() failed: send() 80 bytes to schedd at <127.0.0.1:9618> returned -1, timeout=0, errno=32 Broken pipe.
08/18/20 11:10:44 Buf::write(): condor_write() failed

ERROR: Failed submission for job 7269.-1 - aborting entire submit

ERROR: Failed to queue job.

The strace shows an EPIPE error from a system call.

A submission containing 26483 megabytes worth of inputs succeeds, but 29180 megabytes fails. There's 136G of free space in the /var/lib/condor/execute and spool filesystem, so I'm not running out of space, and I'm also not spooling so condor_submit is just enumerating the input files rather than moving them anywhere.

For the working submission, there's 35,744 individual files and directories in the inputs, and the total length of the paths to each of the input files is 2,999,469 bytes. Adding the additional directory causing it to fail results in 38,892 items and 3,263,860 bytes worth of length.

Am I exceeding a buffer size limit associated with the file transfer enumeration, perhaps? Or is there some issue with condor_submit's communication with the schedd that's tripping things up, either a size limit or a timeout? I don't see anything in SchedLog at default debug levels.


Michael V Pelletier
Principal Engineer

Raytheon Technologies
Information Technology
50 Apple Hill Drive
Tewksbury, MA 01876-1198