[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] watchdog pipe file missing



Fernando Rannou wrote:
Thanks Greg

but then, what should I do to create the file
in the meantime?

Fernando

I'm pretty sure that restarting the StartD (condor_restart -startd) on each machine that is missing the file should do the trick.

Greg

On Wed, Jan 28, 2009 at 4:53 PM, Greg Quinn <gquinn@xxxxxxxxxxx <mailto:gquinn@xxxxxxxxxxx>> wrote:

    Fernando,

    The "watchdog" pipe is created by the ProcD when it starts up, and is
    only ever deleted by Condor when the ProcD shuts down.

    Is it possible that something outside of Condor is deleting the pipe? We
    have seen problems like this before with programs like tmpwatch
    (although I guess it's doubtful that tmpwatch is running over your
    /home/condor/hosts/wolf10/log/ directory).

    Come to think of it, /home/condor/hosts/wolf10/log sounds like it could
    be on NFS. It's perfectly fine to have your LOG directory on NFS, but it
    is in that case required to have a separate local LOCK directory (where
    things like the ProcD's pipes are stored). Please make sure that your
    LOCK setting refers to a local directory.

    Thanks,

    Greg Quinn
    Condor Team

    Fernando Rannou wrote:
     > Hello,
     > I'm getting he following error in one of the StaterLog
     > ------------------------
     > 1/28 11:20:04 About to exec /home/mpetct/sampproc --universal
     > 1/28 11:20:04 error opening watchdog pipe
     > /home/condor/hosts/wolf10/log/procd_pipe.STARTD.watchdog: No such
    file
     > or directory (2)
     > 1/28 11:20:04 ProcFamilyClient: error initializing LocalClient
     > 1/28 11:20:04 ProcFamilyProxy: error initializing ProcFamilyClient
     > 1/28 11:20:04 ERROR "ProcD has failed" at line 599 in file
     > proc_family_proxy.C
     > 1/28 11:20:04 ShutdownFast all jobs.
     > --------------------------
     > Clealry the "pipe" files are not there. What should I do.
     > We restarted condor on all nodes but the files did not appear.
     >
     > This has happened in a couple of nodes. All other nodes do have the
     > watchdog file:
     >
     > prw-rw----    1 root     isl             0 Nov  4 16:08
    procd_pipe.STARTD
     > prw-rw----    1 root     isl             0 Nov  4 16:08
     > procd_pipe.STARTD.watchdog
     > -
     > Thanks
     >
     > Fernando
    _______________________________________________
    Condor-users mailing list
    To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
    <mailto:condor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/condor-users

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/condor-users/



------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/