[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Network filesystem failed to initialize logs



Initially, we had the mount with autofs and nfs but noticed that, after submitting a job that touched the mount, condor would put the job on hold before autofs had a chance to mount it.Â

Now we use a standard mount via glusterfs fuse with pretty much default options. The only option that we use is direct-io-mode=false. What is strange is that the logs/error files fail to initialize but the empty files from transfer_output_files are created correctly (maybe they are created by different condor daemons at different stages).

JoÃo BaÃto
---------------
ScientificÂComputing and Software Platform
Champalimaud Research
Champalimaud Center for the Unknown
Av. BrasÃlia, Doca de PedrouÃos
1400-038 Lisbon, Portugal

fchampalimaud.org


Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> escreveu no dia quinta, 29/08/2019 Ã(s) 21:22:
For autofs on RHEL7 with systemd, I've seen situations where HTCondor will start up and start accepting jobs before autofs has finished starting, and so I added a startd_cron job to check for the existence of the automount daemon process and add that attribute to the start _expression_, so the slot won't accept jobs unless the daemon is running.

It sounds like this may not be quite the crux of the problem in this case, but perhaps something along the same lines could be devised to address it.

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Todd Tannenbaum
Sent: Thursday, August 29, 2019 1:34 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] Network filesystem failed to initialize logs

On 8/29/2019 11:55 AM, Christopher Harrison via HTCondor-users wrote:
> Are you using autofs to mount the glusterfs fuse mount?ÂÂ If so, my
> guess is you have a race condition whereby autofs is not mounting
> before the condor jobs show up.ÂÂ This happened to us a lot too (we
> use autofs).ÂÂ The way we got around this is by applying a
> precondition to the job (through a shell script) to touch a file in the directory.
>
> I hope this helps,
>Â ÂÂÂ -C

I am very in the below problem and Christopher's wisdom about how he fixed it...

Please correct me if I am misunderstanding: It sounds like you guys are saying the first file access into a volume automounted by autofs can fail. If the first file access is performed by the condor_starter in order to create/write the job error or log files, the job ends up on hold. Sounds like Christopher worked around this by having a shell script run in advance of job (somehow) which touches a file in the directory... this touch operation may fail just like the does when the condor_starter is trying to setup the error/log files, but nobody cares because the whole point of the touch operation was just to kick autofs
into action.  Do I have it right?

ps question: is the autofs mount a 'hard' mount (i.e. I/O to glusterfs should block until it is performed successfully) or a 'soft' mount (i.e.
I/O to glusterfs will not block, but could instead quickly return an
error) ?

Thanks guys
Todd



>
> On 8/29/19 5:26 PM, JoÃo BaÃto wrote:
>> Hi,
>>
>> Some of my users keep giving their jobs put on hold due to problems
>> with the initialization of the error and logs files. They are setting
>> the path of these files to the network filesystem (glusterfs mount
>> via fuse).
>>
>> The only way to fix this is to force an ls on the target directory
>> and then run condor_release.
>>
>> Any ideas on why this is happening?
>>
>> I'm running HTCondor v.8.8.4 on CentOS 7.6.
>>
>> Thanks!
>> *JoÃo BaÃto*
>> ---------------
>> *Scientific Computing and Software Platform
>> *
>> Champalimaud Research
>> Champalimaud Center for the Unknown
>> Av. BrasÃlia, Doca de PedrouÃos
>> 1400-038 Lisbon, Portugal
>> fchampalimaud.org <https://www.fchampalimaud.org/>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxxÂ
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> --
>
>
> Christopher Harrison
> Systems Engineer
> Department of Biostatistics & Medical Informatics University of
> Wisconsin School of Medicine and Public Health Office 240 Warf
> 610 Walnut Street
> Madison, WI 53726
> 608.3476.6967
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing ÂDepartment of Computer Sciences
HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/