[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] shared fs high latency slow down the schedd



Hi

we have a HTCondor cluster for local jobs submission which exploits a shared filesystem.


***
[root@ettore ~]# condor_schedd -version
$CondorVersion: 8.4.6 Apr 20 2016 BuildID: 364106 $
$CondorPlatform: x86_64_RedHat6 $
[root@ettore ~]#

[root@ettore ~]# condor_config_val -dump FILESYSTEM_DOMAIN

FILESYSTEM_DOMAIN = GPFS

[italiano@ui02 ~]$ condor_config_val -dump FILESYSTEM_DOMAIN
FILESYSTEM_DOMAIN = GPFS
***
Everytime the filesystem experiences high latency while accessing files for example during a restripe operation on the file system, the schedd serving the local job submission hangs. In this status a condor_reconfig takes several minute di be applied.

So, it seems that the slow filesystem performance negatively affects the schedd response time and sometime it also becomes unresponsive 

***
[root@ettore ~]# condor_q

-- Failed to fetch ads from: <90.147.169.224:38705> : ettore.recas.ba.infn.it
SECMAN:2007:Failed to end classad message.
[root@ettore ~]#
***

Is there a way to preserve schedd functionalities  during such situations ?

In the same cluster there are also other schedds serving grid jobs which are NOT affected by this behaviour.

thanks in advance for any hint you would like to share

Ale



Attachment: smime.p7s
Description: S/MIME cryptographic signature