[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Network filesystem failed to initialize logs



On 8/29/2019 11:55 AM, Christopher Harrison via HTCondor-users wrote:
> Are you using autofs to mount the glusterfs fuse mount?ÂÂ If so, my 
> guess is you have a race condition whereby autofs is not mounting before 
> the condor jobs show up.ÂÂ This happened to us a lot too (we use 
> autofs).ÂÂ The way we got around this is by applying a precondition to 
> the job (through a shell script) to touch a file in the directory.
> 
> I hope this helps,
>  ÂÂÂ -C

I am very in the below problem and Christopher's wisdom about how he 
fixed it...

Please correct me if I am misunderstanding: It sounds like you guys are 
saying the first file access into a volume automounted by autofs can 
fail.  If the first file access is performed by the condor_starter in 
order to create/write the job error or log files, the job ends up on 
hold.  Sounds like Christopher worked around this by having a shell 
script run in advance of job (somehow) which touches a file in the 
directory... this touch operation may fail just like the does when the 
condor_starter is trying to setup the error/log files, but nobody cares 
because the whole point of the touch operation was just to kick autofs 
into action.    Do I have it right?

ps question: is the autofs mount a 'hard' mount (i.e. I/O to glusterfs 
should block until it is performed successfully) or a 'soft' mount (i.e. 
I/O to glusterfs will not block, but could instead quickly return an 
error) ?

Thanks guys
Todd



> 
> On 8/29/19 5:26 PM, JoÃo BaÃto wrote:
>> Hi,
>>
>> Some of my users keep giving their jobs put on hold due to problems 
>> with the initialization of the error and logs files. They are setting 
>> the path of these files to the network filesystem (glusterfs mount via 
>> fuse).
>>
>> The only way to fix this is to force an ls on the target directory and 
>> then run condor_release.
>>
>> Any ideas on why this is happening?
>>
>> I'm running HTCondor v.8.8.4 on CentOS 7.6.
>>
>> Thanks!
>> *JoÃo BaÃto*
>> ---------------
>> *Scientific Computing and Software Platform
>> *
>> Champalimaud Research
>> Champalimaud Center for the Unknown
>> Av. BrasÃlia, Doca de PedrouÃos
>> 1400-038 Lisbon, Portugal
>> fchampalimaud.org <https://www.fchampalimaud.org/>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx  with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> -- 
> 
> 
> Christopher Harrison
> Systems Engineer
> Department of Biostatistics & Medical Informatics
> University of Wisconsin School of Medicine and Public Health
> Office 240 Warf
> 610 Walnut Street
> Madison, WI 53726
> 608.3476.6967
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685