[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_off broken?



On 11/27/2013 9:29 AM, Zachary Miller wrote:
On Wed, Nov 27, 2013 at 01:05:40PM +0100, Pek Daniel wrote:
So now I have some more information:

condor_off command and friends won't work if the hostname is set to
condorworker02 on the machine. It has to be set to condorworker02.domain.tld.

The question: why is that?

In my *opinion*, this should work.

But clearly, it does not.  I will need to investigate the code, but my general
feeling is that at some point, the tool (condor_off in this case) gets clever
and "promotes" the short host name to the long host name.  Then, as you can see
in the collector, it doesn't match, and you get your "Daemon not found" error.


This is a known issue that should be improved in the code. For related info/background see

  1. https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3694
  (in particular, remark by "tannenba", the second remark on the ticket),


  2. https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=636

regards,
Todd


Cheers,
-zach




2013/11/26 Pek Daniel <pekdaniel@xxxxxxxxx>

     OK, the problem a bit more detailed:

     I'm using this version:
     [root@lxbrb1815 ~]# condor_version
     $CondorVersion: 8.1.2 Oct 19 2013 BuildID: 189797 $
     $CondorPlatform: x86_64_RedHat5 $

     Here's a snippet from condor_status -master output:
     [root@condormaster1 ~]# condor_status -master
     Name

     condormaster1
     condormaster2
     condorworker02
     lxbrb1815.domain.tld
     ...

     I have physical nodes and VMs as startd nodes. Physical nodes have more
     than one core, so more than one jobslots, while VMs have only one core.

     Here's a snippet from condor_status -startd | head:
     Name               OpSys      Arch   State     Activity LoadAv Mem
     ActvtyTime

     condorworker02     LINUX      X86_64 Claimed   Busy      0.000  490
     0+00:03:13
     slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.060 1991
     0+00:11:51
     slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:13
     ...

     As you can see, condorworker02 is a VM, while lxbrb1815.domain.tld is a
     physical node with a lot of cores. And that's the only difference. The
     config file is exactly the same for both cases, and the condor version as
     well.

     Now, my questions:
     - Why I see the slotID@xxxxxxxxxxxxxxxxxxxxxx in case of physical nodes and
     just the hostname in case of VMs?
     - Why can't I query the status of a VM but it's working in case of a
     physical node:

     [root@condormaster1 ~]# condor_status -startd lxbrb1815
     Name               OpSys      Arch   State     Activity LoadAv Mem
     ActvtyTime

     slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.060 1991
     0+00:11:51
     slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:13
     slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:14
     slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:15
     slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:16
     slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:17
     slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:18
     slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:11
                          Total Owner Claimed Unclaimed Matched Preempting
     Backfill

             X86_64/LINUX     8     0       0         8       0          0
      0

                    Total     8     0       0         8       0          0
      0
     [root@condormaster1 ~]# condor_status -startd lxbrb1815.domain.tld
     Name               OpSys      Arch   State     Activity LoadAv Mem
     ActvtyTime

     slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.060 1991
     0+00:11:51
     slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:13
     slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:14
     slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:15
     slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:16
     slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:17
     slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:18
     slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991
     0+00:12:11
                          Total Owner Claimed Unclaimed Matched Preempting
     Backfill

             X86_64/LINUX     8     0       0         8       0          0
      0

                    Total     8     0       0         8       0          0
      0

     [root@condormaster1 ~]# condor_status -startd condorworker02
     [root@condormaster1 ~]# condor_status -startd condorworker02.domain.tld
     [root@condormaster1 ~]#

     - Why can't I send condor_off command to VMs but it's working fine in case
     of physical nodes:
     [root@condormaster1 ~]# condor_off -startd lxbrb1815
     Sent "Kill-Daemon" command for "startd" to master lxbrb1815.domain.tld

     [root@condormaster1 ~]# condor_off -startd condorworker02
     Can't find address for master condorworker02.domain.tld
     Perhaps you need to query another pool.

     Thanks,
     Daniel



     2013/11/26 Zachary Miller <zmiller@xxxxxxxxxxx>

         On Tue, Nov 26, 2013 at 11:37:48AM +0100, Pek Daniel wrote:
         > I'm trying to "deactivate" some startd machines:
         > [root@cm1 ~]# condor_status
         > Name               OpSys      Arch   State     Activity LoadAv Mem
         ActvtyTime
         >
         > condorworker01     LINUX      X86_64 Unclaimed Idle      0.000 2006
         5+16:16:41
         > condorworker03     LINUX      X86_64 Unclaimed Idle      0.000  490
         0+00:21:47
         > slot1@lxbrl2305    LINUX      X86_64 Unclaimed Idle      1.000 1991
         4+18:20:46
         > slot2@lxbrl2305    LINUX      X86_64 Unclaimed Idle      1.000 1991
         4+18:21:07
         > slot3@lxbrl2305    LINUX      X86_64 Unclaimed Idle      1.000 1991
         4+18:21:08
         > slot4@lxbrl2305    LINUX      X86_64 Unclaimed Idle      1.000 1991
         4+18:21:09
         > slot5@lxbrl2305    LINUX      X86_64 Unclaimed Idle      1.000 1991
         4+18:21:10
         > slot6@lxbrl2305    LINUX      X86_64 Unclaimed Idle      0.960 1991 4
         +18:21:11
         > slot7@lxbrl2305    LINUX      X86_64 Unclaimed Idle      0.000 1991
         4+18:21:12
         > slot8@lxbrl2305    LINUX      X86_64 Unclaimed Idle      0.000 1991
         4+18:21:05
         >                      Total Owner Claimed Unclaimed Matched Preempting
         Backfill
         >
         >         X86_64/LINUX    10     0       0        10       0          0
                0
         >
         >                Total    10     0       0        10       0          0
                0
         >
         > [root@condormaster1 ~]# condor_off -startd -graceful condorworker01
         > Can't find address for master condorworker01

         Hmmm.  What does "condor_status -master" have to say?


         Cheers,
         -zach

         _______________________________________________
         HTCondor-users mailing list
         To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
         with a
         subject: Unsubscribe
         You can also unsubscribe by visiting
         https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

         The archives can be found at:
         https://lists.cs.wisc.edu/archive/htcondor-users/





_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685