[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Matchmaking errors



Hi Christoph,

Am 12.04.2018 um 13:18 schrieb Beyer, Christoph:
> Hi,
> 
> just my 2cent, the shadowdaemon on the scheduler is not able to write the userlog file in the (probably in the submit file) defined location, hence the job is not transfered to the workernode but goes back to queue. 

Thanks for the ideas! 
The problem is that the file is definitely accessible and writable. The user submitted ~2000 jobs,
all writing log files to the same directory. 
It does not even matter if the user defines the home directory (only accessible on the schedd machine, not on the workernode)
or on the "cluster directory" (accessible both from the schedd machine and the workernode). 
Homes are gpfs and "cluster directory" is cephfs, so it does not seem likely to be a filesystem issue (or maybe a complicated one...). 
The log file in question was even created and contains several entries(!). 

In all cases, a random subset of jobs fails. 

> 
> As the same problem occurs on the next 'run' the job ends up being removed after JobRunCount > 10 (?)
> 
> Best
> Christoph
> 

What may be true, though, for both filesystems, is that the creation of many small files in one directory comes at the price of latency. 
Is it possible that doWriteEvent() fails hard if the write attempt takes more than a few 100 ms? 

Cheers,
	Oliver

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature