[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] submitting issues when having huge transfer_input_files lists



Hello,

we are running HTcondor v 8.8.9 and observed an issue when submitting
jobs with a huge `transfer_input_files` list in the submit file. We have
ca 46000 entries there and this line alone has a size of 8.9MB.

Occasionally the submit works. It always works with a shorter
`transfer_imput_files` list.

With the longer list the submit processes often stops with the message:

ERROR: Failed submission for job 47724231.-1 - aborting entire submit
ERROR: Failed to queue job.

In the SchedLog We also see

04/14/23 09:20:29 (pid:3421) condor_read(): timeout reading 5 bytes from <x.x.x.x:1071>.
04/14/23 09:20:29 (pid:3421) IO: Failed to read packet header

and often subsequently:

04/14/23 09:20:29 (pid:3421) TransferQueueManager stats: active up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
04/14/23 09:20:29 (pid:3421) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
04/14/23 09:20:29 (pid:3421) TransferQueueManager download 1m I/O load: 352 bytes/s  0.000 disk load  0.000 net load

We haven't enabled any debug output yet.

I suspect we hit some kind of timeout or other limits but haven't seen
something particular in the config, which could solve this issue.

Is this behavior expected? Can we circumvent this?
Do we need to provide more information?

Thank you,
Henning