Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Matchmaking errors

Date: Thu, 12 Apr 2018 13:50:45 +0200
From: Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Matchmaking errors

Hi Christoph,

Am 12.04.2018 um 13:18 schrieb Beyer, Christoph:
> Hi,
> 
> just my 2cent, the shadowdaemon on the scheduler is not able to write the userlog file in the (probably in the submit file) defined location, hence the job is not transfered to the workernode but goes back to queue. 

Thanks for the ideas! 
The problem is that the file is definitely accessible and writable. The user submitted ~2000 jobs,
all writing log files to the same directory. 
It does not even matter if the user defines the home directory (only accessible on the schedd machine, not on the workernode)
or on the "cluster directory" (accessible both from the schedd machine and the workernode). 
Homes are gpfs and "cluster directory" is cephfs, so it does not seem likely to be a filesystem issue (or maybe a complicated one...). 
The log file in question was even created and contains several entries(!). 

In all cases, a random subset of jobs fails. 

> 
> As the same problem occurs on the next 'run' the job ends up being removed after JobRunCount > 10 (?)
> 
> Best
> Christoph
> 

What may be true, though, for both filesystems, is that the creation of many small files in one directory comes at the price of latency. 
Is it possible that doWriteEvent() fails hard if the write attempt takes more than a few 100 ms? 

Cheers,
	Oliver

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

References:
- [HTCondor-users] Matchmaking errors
  - From: Oliver Freyermuth
- Re: [HTCondor-users] Matchmaking errors
  - From: Beyer, Christoph

Prev by Date: Re: [HTCondor-users] Matchmaking errors
Next by Date: Re: [HTCondor-users] Condor Schedd and Condor workers in docker containers on separate hosts
Previous by thread: Re: [HTCondor-users] Matchmaking errors
Next by thread: [HTCondor-users] Singularity: force mount specific directory (job scratch dir) fails when home dir is not accessible
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Matchmaking errors