[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] log files on lustre



I've run across an odd issue.  if i submit a job with a queue
parameter that is large > 10k, and i set the Log (single file)/Output
(one per procid)/Error (one per procid) parameters in the submit file
to be on a lustre filesystem

the job will run, but it drives the load on the submit machine up over
2000 and the jobs basically sit in Claimed:Idle

if I change only the Log parameter to be on an nfs (netapp) file
system the job submits and runs normally, no high load, no
claimed:idle states

the lustre filesystem is more then large enough to handle the file i/o
load, and it's not currently under any load

has anyone seen this or something like it before?

any thoughts on what condor might be doing differently when writing
the log file on nfs as opposed to lustre?

any recommendations on tracing the system calls to see what condor
might be doing?  strace on the schedd is good, but too much data and
i'm not sure how to whittle it down into anything useful

i'm running condor 8.4.0 on rhel-6.7 x86_64