[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] schedds not returning the jobs correctly



Hi Marco,

The name "D_FULLDEBUG" is perhaps misleading.  To actually see everything in the debug output, use "D_ALL:2" instead when you run condor_q.

(You will get a lot of output, and feel free to send it to me offline so I can take a look at why you aren't seeing any jobs.)


Cheers,
-zach


ïOn 2/22/19, 11:47 AM, "HTCondor-users on behalf of Marco Mambelli" <htcondor-users-bounces@xxxxxxxxxxx on behalf of marcom@xxxxxxxx> wrote:

    The schedds on a GlideinWMS factory seem not to work correctly:
    - there are jobs running and queued and they are visible via condor_status -schedd
    - condor_q -g returns nothing, not even "All queues are empty"
    
    
    $ condor_q -g -xml
    <?xml version="1.0"?>
    <!DOCTYPE classads SYSTEM "classads.dtd">
    <classads>
    </classads>
    <?xml version="1.0"?>
    <!DOCTYPE classads SYSTEM "classads.dtd">
    <classads>
    </classads>
    <?xml version="1.0"?>
    <!DOCTYPE classads SYSTEM "classads.dtd">
    <classads>
    </classads>
    <?xml version="1.0"?>
    <!DOCTYPE classads SYSTEM "classads.dtd">
    <classads>
    </classads>
    <?xml version="1.0"?>
    <!DOCTYPE classads SYSTEM "classads.dtd">
    <classads>
    </classads>
    
    $ condor_q -g
    
    $ condor_status -schedd
    Name                                      Machine                  RunningJobs   IdleJobs   HeldJobs
    
    cmsgwms-factory.fnal.gov                  cmsgwms-factory.fnal.gov        1174       1001          0
    schedd_glideins2@myhost 			myhost        1635        831         22
    schedd_glideins3@myhost			myhost         285        812         22
    schedd_glideins4@myhost			myhost        1794        997          1
    schedd_glideins5@myhost			myhost        2007       1074          8
    
                          TotalRunningJobs      TotalIdleJobs      TotalHeldJobs
    
    
                   Total              6895               4715                 53
    
    
    [I replaced the hostname w/ "myhost" here, it was correct]
    $ condor_q -version
    $CondorVersion: 8.6.11 May 10 2018 BuildID: 440910 $
    $CondorPlatform: x86_64_RedHat7 $
    
    The schedd logs are all unusually flat, a bunch of "Number of Active Workers 0" lines (rarely w/ N<>0) and with a strange line
    "Can't find address for startd myhost"  
    There is no startd on the factory host, it is not in the daemon list
    
    02/22/19 11:20:18 (pid:41978) Number of Active Workers 0
    02/22/19 11:20:19 (pid:41978) Number of Active Workers 0
    02/22/19 11:20:19 (pid:41978) TransferQueueManager stats: active up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
    02/22/19 11:20:19 (pid:41978) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
    02/22/19 11:20:19 (pid:41978) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
    02/22/19 11:20:19 (pid:41978) Started condor_gmanager for owner cmsglobal_1 pid=1423401
    02/22/19 11:20:19 (pid:41978) Can't find address for startd myhost
    02/22/19 11:20:20 (pid:41978) Number of Active Workers 0
    
    
    Something is wrong but I cannot understand what.
    
    condor_config_val seems to return the correct spool and address files, also querying directly w/ -address
    d_fulldebug is not much of help:
    
     _CONDOR_TOOL_DEBUG="D_FULLDEBUG" condor_q -debug -g
    02/22/19 11:39:51 Result of reading /etc/issue:  \S
    
    02/22/19 11:39:51 Result of reading /etc/redhat-release:  Scientific Linux release 7.5 (Nitrogen)
    
    02/22/19 11:39:51 Using IDs: 16 processors, 8 CPUs, 8 HTs
    02/22/19 11:39:51 Reading condor configuration from '/etc/condor/condor_config'
    02/22/19 11:39:51 Enumerating interfaces: lo 127.0.0.1 up
    02/22/19 11:39:51 Enumerating interfaces: eth2 131.225.X.X up
    02/22/19 11:39:51 WARNING: Config source is empty: /etc/condor/config.d/90_condor_test
    02/22/19 11:39:51 Will use TCP to update collector cmsgwms-factory.fnal.gov <131.225.X.X:9618>
    02/22/19 11:39:51 Trying to query collector <131.225.X.X:9618>
    02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_9
    02/22/19 11:39:51 Sent classad to schedd
    02/22/19 11:39:51 Got classad from schedd.
    02/22/19 11:39:51 Ad was last one from schedd.
    02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_5
    02/22/19 11:39:51 Sent classad to schedd
    02/22/19 11:39:51 Got classad from schedd.
    02/22/19 11:39:51 Ad was last one from schedd.
    02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_6
    02/22/19 11:39:51 Sent classad to schedd
    02/22/19 11:39:51 Got classad from schedd.
    02/22/19 11:39:51 Ad was last one from schedd.
    02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_7
    02/22/19 11:39:51 Sent classad to schedd
    02/22/19 11:39:52 Got classad from schedd.
    02/22/19 11:39:52 Ad was last one from schedd.
    02/22/19 11:39:52 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_8
    02/22/19 11:39:52 Sent classad to schedd
    02/22/19 11:39:52 Got classad from schedd.
    02/22/19 11:39:52 Ad was last one from schedd.
    
    $ _CONDOR_TOOL_DEBUG="D_FULLDEBUG" condor_q -debug -g -xml
    02/22/19 11:42:45 Result of reading /etc/issue:  \S
    
    02/22/19 11:42:45 Result of reading /etc/redhat-release:  Scientific Linux release 7.5 (Nitrogen)
    
    02/22/19 11:42:45 Using IDs: 16 processors, 8 CPUs, 8 HTs
    02/22/19 11:42:45 Reading condor configuration from '/etc/condor/condor_config'
    02/22/19 11:42:45 Enumerating interfaces: lo 127.0.0.1 up
    02/22/19 11:42:45 Enumerating interfaces: eth2 131.225.X.X up
    02/22/19 11:42:45 WARNING: Config source is empty: /etc/condor/config.d/90_condor_test
    02/22/19 11:42:45 Will use TCP to update collector cmsgwms-factory.fnal.gov <131.225.X.X:9618>
    02/22/19 11:42:45 Trying to query collector <131.225.X.X:9618>
    02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_9
    02/22/19 11:42:45 Sent classad to schedd
    02/22/19 11:42:45 Got classad from schedd.
    02/22/19 11:42:45 Ad was last one from schedd.
    <?xml version="1.0"?>
    <!DOCTYPE classads SYSTEM "classads.dtd">
    <classads>
    </classads>
    02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_5
    02/22/19 11:42:45 Sent classad to schedd
    02/22/19 11:42:45 Got classad from schedd.
    02/22/19 11:42:45 Ad was last one from schedd.
    <?xml version="1.0"?>
    <!DOCTYPE classads SYSTEM "classads.dtd">
    <classads>
    </classads>
    02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_6
    02/22/19 11:42:45 Sent classad to schedd
    02/22/19 11:42:45 Got classad from schedd.
    02/22/19 11:42:45 Ad was last one from schedd.
    <?xml version="1.0"?>
    <!DOCTYPE classads SYSTEM "classads.dtd">
    <classads>
    </classads>
    02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_7
    02/22/19 11:42:45 Sent classad to schedd
    02/22/19 11:42:45 Got classad from schedd.
    02/22/19 11:42:45 Ad was last one from schedd.
    <?xml version="1.0"?>
    <!DOCTYPE classads SYSTEM "classads.dtd">
    <classads>
    </classads>
    02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_8
    02/22/19 11:42:45 Sent classad to schedd
    02/22/19 11:42:45 Got classad from schedd.
    02/22/19 11:42:45 Ad was last one from schedd.
    <?xml version="1.0"?>
    <!DOCTYPE classads SYSTEM "classads.dtd">
    <classads>
    </classads>
    
    Krista saw also:
    ERROR "Assertion ERROR on (!(pjr->flags & 0x0004))" at line 4319 in file /slots/16/dir_3109781/userdir/.tmpVhmMVH/BUILD/condor-8.6.11/src/condor_q.V6/queue.cpp
    
    
    Any suggestion about what is wrong?
    Thanks you,
    Marco
    
    
    _______________________________________________
    HTCondor-users mailing list
    To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
    
    The archives can be found at:
    https://lists.cs.wisc.edu/archive/htcondor-users/