[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] core.STARTER generated



Todd,Â

Thank you for your reply. Answering your questions / thoughts:

I should have said that I have all new machines (VMs) with true RHEL 7, and my new machines are on a different network from my old condor w/ RHEL 6.
The new RHEL 7 machines are on an isolated network from the other network where the machines w/ my old condor 7.8.1 on RHEL 6 lives.
The only relation is that I carried over some of my custom START / PREEMPT settings from my old condor 7.8.1 config file into my new condor 8.6.8 environment.Â

Everything below here is about my new RHEL 7 machines on my new network.
All machines have the same binaries, but for now, I am only running condor on one machine, the central manager, and I submit from that same machine.

I am now using SEC_DEFAULT_NEGOTIATION=REQUIRED, confirmed with condor_config_val

I ran with STARTER_DEBUG = D_ALLÂ in my config.

I am sending a link to the developers at tÂhtcondor-admin@xxxxxxxxxxx where they can download a tar file with the following contents

[root@njrarltapp001a8 log]# tar tzvf files.with.core.STARTER.tar.gz
-rw-r--r-- condor/condor 9840 2017-12-21 21:42 CollectorLog
-rw-r--r-- root/root    52 2017-12-21 21:38 redhat-release
-rw------- root/root  774144 2017-12-21 21:41 core.STARTER
-rw-r--r-- root/root   Â617 2017-12-21 21:41 KernelTuning.log
-rw-r--r-- condor/condor 3750 2017-12-21 21:42 MasterLog
-rw-r--r-- condor/condor 20468 2017-12-21 21:42 NegotiatorLog
-rw-r--r-- condor/condor Â510 2017-12-21 21:41 ScheddRestartReport
-rw-r--r-- condor/condor 2618 2017-12-21 21:41 ShadowLog
-rw-r--r-- condor/condor 3552 2017-12-21 21:41 startd_history
-rw-r--r-- condor/condor 51588 2017-12-21 21:41 StarterLog.slot1
-rw-r--r-- condor/condor Â747 2017-12-21 21:41 XferStatsLog
-rw-r--r-- root/root  Â36179 2017-12-21 21:42 condor_config_val-dump.out
-rw-r--r-- condor/condor Â278 2017-12-21 21:41 MatchLog
-rw-r--r-- root/root  Â36930 2017-12-21 21:42 ProcLog
-rw-r--r-- condor/condor 5528 2017-12-21 21:42 SchedLog
-rw-r--r-- condor/condor 2630 2017-12-21 21:42 SharedPortLog
-rw-r--r-- condor/condor 2883 2017-12-21 21:41 StarterLog
-rw-r--r-- condor/condor 158531 2017-12-21 21:42 StartLog




On Thu, Dec 21, 2017 at 5:30 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 12/21/2017 2:36 PM, Lee Mitchell wrote:
Hello All,

I am upgrading from

$CondorVersion: 7.8.1 Jun 08 2012 $
$CondorPlatform: x86_64_rhap_6.2 $

to

$CondorVersion: 8.6.8 Nov 13 2017 BuildID: 424045 $
$CondorPlatform: x86_64_RedHat7 $

I set all SEC_*Â knobs to NEVER; I have been relaxing security trying to get my test job (shell script that calls sleep) to run.
I finally have it negotiating and matching, but when it begins the start a core.STARTER file is created in my log dir, with the below in myÂStarterLog.slot1

By the way, previously this same job would run sucessfully after it sat in the queue for some time (20 mintues?) after a message in the SchedLog saying something like "Have not heard from Negotiator for a while, running local jobs..."

Any advice is greatly appreciated. Thx, Lee


Some quick thoughts (hunches?) kinda in the order they occuried to me-

1. You upgraded your worker node (condor_startd/condor_starter) to v8.6.8. What version is your submit node (condor_schedd) running? Is it still running v7.8? We try to maintain compatibility across HTCondor versions, but there is a very big gap between v7.8 and v8.6, it would not surprise me if problems appear. Try your tests using a submit node also running v8.6.Â
Â

2. I would not advise setting SEC_DEFAULT_NEGOTIATION=NEVER. Try leaving that one at the default, or do SEC_DEFAULT_NEGOTIATION=REQUIRED.

3. You are using binaries compiled for RHEL7... this is indeed running on a RHEL7 or Centos7 system, right?

4. If the above doesn't help, try putting STARTER_DEBUG = D_ALL in the condor_config on your worker node and run again. This time your StarterLog should contain a lot more messages which could help make it more obvious where things are going wrong.

5. If things aren't clarified from #4 above, you could make core.STARTER file available on the internet and send a message to the developers
at htcondor-admin@xxxxxxxxxxx (or at htcondor-support@xxxxxxxxxxx if you have a support contract) telling them how to pick it up.


best regards
Todd



12/21/17 15:22:39 (pid:22864) ******************************************************
12/21/17 15:22:39 (pid:22864) ** condor_starter (CONDOR_STARTER) STARTING UP
12/21/17 15:22:39 (pid:22864) ** /opt/condor/sbin/condor_starter
12/21/17 15:22:39 (pid:22864) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
12/21/17 15:22:39 (pid:22864) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
12/21/17 15:22:39 (pid:22864) ** $CondorVersion: 8.6.8 Nov 13 2017 BuildID: 424045 $
12/21/17 15:22:39 (pid:22864) ** $CondorPlatform: x86_64_RedHat7 $
12/21/17 15:22:39 (pid:22864) ** PID = 22864
12/21/17 15:22:39 (pid:22864) ** Log last touched 12/21 15:22:30
12/21/17 15:22:39 (pid:22864) ******************************************************
12/21/17 15:22:39 (pid:22864) Using config source: /opt/condor/etc/condor_config
12/21/17 15:22:39 (pid:22864) Using local config sources:
12/21/17 15:22:39 (pid:22864)Â Â /opt/condor/local/condor_config.local
12/21/17 15:22:39 (pid:22864) config Macros = 165, Sorted = 164, StringBytes = 5335, TablesBytes = 5988
12/21/17 15:22:39 (pid:22864) CLASSAD_CACHING is OFF
12/21/17 15:22:39 (pid:22864) Daemon Log is logging: D_ALWAYS D_ERROR
12/21/17 15:22:39 (pid:22864) SharedPortEndpoint: waiting for connections to named socket 19607_a7b6_4
12/21/17 15:22:39 (pid:22864) DaemonCore: command socket at <10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19607_a7b6_4 <http://10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19607_a7b6_4>>
12/21/17 15:22:39 (pid:22864) DaemonCore: private command socket at <10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19607_a7b6_4 <http://10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19607_a7b6_4>>
12/21/17 15:22:39 (pid:22864) Communicating with shadow <10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19606_f356_4 <http://10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19606_f356_4>>
12/21/17 15:22:39 (pid:22864) Submitting machine is "njrarltapp001a8.mgmt.ams1907.com <http://njrarltapp001a8.mgmt.ams1907.com>"
12/21/17 15:22:39 (pid:22864) setting the orig job name in starter
12/21/17 15:22:39 (pid:22864) setting the orig job iwd in starter
12/21/17 15:22:39 (pid:22864) Chirp config summary: IO false, Updates false, Delayed updates true.
12/21/17 15:22:39 (pid:22864) Initialized IO Proxy.
12/21/17 15:22:39 (pid:22864) Done setting resource limits
12/21/17 15:22:39 (pid:22864) File transfer completed successfully.
Stack dump for process 22864 at timestamp 1513887760 (14 frames)
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(dprintf_dump_stack+0x72)[0x7fcd88c3cea2]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_Z18linux_sig_coredumpi+0x24)[0x7fcd88dc74a4]
/lib64/libpthread.so.0(+0xf5e0)[0x7fcd873165e0]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZNK17CondorVersionInfo19built_since_versionEiii+0x10)[0x7fcd88c94dd0]
condor_starter(REMOTE_CONDOR_dprintf_stats+0x39)[0x440629]
condor_starter(_ZN9JICShadow17transferCompletedEP12FileTransfer+0x13b)[0x4295fb]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN12FileTransfer6ReaperEP7Serviceii+0x1b8)[0x7fcd88c70a38]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10DaemonCore10CallReaperEiPKcii+0x12d)[0x7fcd88da73bd]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10DaemonCore17HandleProcessExitEii+0x1b9)[0x7fcd88da9ca9]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10DaemonCore24HandleDC_SERVICEWAITPIDSEi+0x7c)[0x7fcd88da9e6c]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10DaemonCore6DriverEv+0x6b2)[0x7fcd88daa552]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_Z7dc_mainiPPc+0x13a4)[0x7fcd88dcab04]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fcd86f65c05]
condor_starter[0x422840]



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing ÂDepartment of Computer Sciences
HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685