[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_schedd freeze/condor_submit very slow



Hi,

I have a problem with condor_schedd :
if there is no activity in the queue for about 1 hour (no running or
idle job, no submit), the next submit is very slow. All other commands
work fine (condor_q and condor_status for example).

My poll is composed of 17 execution nodes (gentoo Linux), one central
manager (Negotiator/Collector/CondorView, Debian sarge) and one
submission node (sched only, debian linux). All with condor 6.7.7
The pool is not in production so there is very few activity on it.


Here is an example :
#########################################################
baccon@nasca:~/bzip_bench$ condor_q


-- Submitter: nasca : <10.5.129.14:9551> : nasca
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE
CMD               

0 jobs; 0 idle, 0 running, 0 held
baccon@nasca:~/bzip_bench$ condor_submit uncompress_100M.sub 
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 255.
baccon@nasca:~/bzip_bench$ 
#########################################################
the first condor_q work fine but the condor_submit take 15 minutes as
shown in the logs of condor_schedd :
#########################################################
8/30 17:15:14 DaemonCore: Command received via TCP from host
<10.5.129.14:9432>
8/30 17:15:14 DaemonCore: received command 1111 (QMGMT_CMD), calling
handler (handle_q)
8/30 17:15:14 condor_read(): Socket closed when trying to read buffer
8/30 17:15:14 QMGR Connection closed
8/30 17:15:31 DaemonCore: Command received via TCP from host
<10.5.129.14:9001>
8/30 17:15:31 DaemonCore: received command 1111 (QMGMT_CMD), calling
handler (handle_q)
8/30 17:31:04 AUTHENTICATE_FS: used file /tmp/qmgr_aqbMWt, status: 1
8/30 17:31:04 OwnerCheck retval 1 (success),no ad
8/30 17:31:04 OwnerCheck retval 1 (success),no ad
8/30 17:31:04 get_file(): going to write to
filename /u/condor/spool/cluster255.ickpt.subproc0
#########################################################
all commands send to the condor_schedd between 17:15 and 17:30 are
freezed (like condor_q) but condor_status work fine.

the file (/tmp/qmgr_aqbMWt) used for authentification is created at
17:15 and is owned by baccon which seem fine.

There is nothing special in the master/collector/negotiator logs

I don't understand why there is this problem, could you help me ?


Best regards,

Jean-Christophe Baccon

-- 
Jean-Christophe Baccon
Service Informatique de Recherche
Division Informatique
Université de Cergy-Pontoise
01 34 25 70 69