[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Problem with job submission with large input



I did a good bit of experimentation on it, and wound up more confused than before.

It appears that less than 8GB worth of input data consistently works in this instance, and that's the workaround I implemented. (On the plus side the system can now split large inputs across multiple jobs to run in parallel based on a user config setting.)

However I also tested the length of the total list of files and directories after the OS walk, and a couple of other variables, to no avail.

I also thought a timeout might be at play, but the total time was consistently less than 30 seconds. I'll try tweaking the submit timeout and see what happens. I assume it's either a client or schedd setting, I'll reconfig the schedd just in case.

Michael V Pelletier
Principal Engineer

Raytheon Technologies
Information Technology
Digital Transormation & Innovation
 


-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Greg Thain
Sent: Thursday, August 20, 2020 3:22 PM
To: Michael Pelletier via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] Problem with job submission with large input

Michael:


I don't know of any per-byte limit on file transfer, but there is a timeout when submit talks to the schedd -- could you be hitting this? Trying setting either

SUBMIT_TIMEOUT_MULTIPLIER=100 condor_submit submit-args


or maybe

TOOL_TIMEOUT_MULTIPLIER=100 condor_submit submit-args


and see if you still hit the problem.

-greg

On 8/18/20 10:14 AM, Michael Pelletier via HTCondor-users wrote:
> Hello,
>
> I've got a job submission that accepts a list of input directories as an argument on the condor_submit command line, and when that list grows beyond a certain point, the submission fails:
>
> Submitting job(s)
> 08/18/20 11:10:44 condor_write() failed: send() 80 bytes to schedd at <127.0.0.1:9618> returned -1, timeout=0, errno=32 Broken pipe.
> 08/18/20 11:10:44 Buf::write(): condor_write() failed
>
> ERROR: Failed submission for job 7269.-1 - aborting entire submit
>
> ERROR: Failed to queue job.
>
> The strace shows an EPIPE error from a system call.
>
> A submission containing 26483 megabytes worth of inputs succeeds, but 29180 megabytes fails. There's 136G of free space in the /var/lib/condor/execute and spool filesystem, so I'm not running out of space, and I'm also not spooling so condor_submit is just enumerating the input files rather than moving them anywhere.
>
> For the working submission, there's 35,744 individual files and directories in the inputs, and the total length of the paths to each of the input files is 2,999,469 bytes. Adding the additional directory causing it to fail results in 38,892 items and 3,263,860 bytes worth of length.
>
> Am I exceeding a buffer size limit associated with the file transfer enumeration, perhaps? Or is there some issue with condor_submit's communication with the schedd that's tripping things up, either a size limit or a timeout? I don't see anything in SchedLog at default debug levels.
>
>
> Michael V Pelletier
> Principal Engineer
>
> Raytheon Technologies
> Information Technology
> 50 Apple Hill Drive
> Tewksbury, MA 01876-1198
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx 
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/