[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] core.STARTER generated



On 12/21/2017 2:36 PM, Lee Mitchell wrote:
Hello All,

I am upgrading from

$CondorVersion: 7.8.1 Jun 08 2012 $
$CondorPlatform: x86_64_rhap_6.2 $

to

$CondorVersion: 8.6.8 Nov 13 2017 BuildID: 424045 $
$CondorPlatform: x86_64_RedHat7 $

I set all SEC_*Â knobs to NEVER; I have been relaxing security trying to get my test job (shell script that calls sleep) to run. I finally have it negotiating and matching, but when it begins the start a core.STARTER file is created in my log dir, with the below in myÂStarterLog.slot1

By the way, previously this same job would run sucessfully after it sat in the queue for some time (20 mintues?) after a message in the SchedLog saying something like "Have not heard from Negotiator for a while, running local jobs..."

Any advice is greatly appreciated. Thx, Lee


Some quick thoughts (hunches?) kinda in the order they occuried to me-

1. You upgraded your worker node (condor_startd/condor_starter) to v8.6.8. What version is your submit node (condor_schedd) running? Is it still running v7.8? We try to maintain compatibility across HTCondor versions, but there is a very big gap between v7.8 and v8.6, it would not surprise me if problems appear. Try your tests using a submit node also running v8.6.

2. I would not advise setting SEC_DEFAULT_NEGOTIATION=NEVER. Try leaving that one at the default, or do SEC_DEFAULT_NEGOTIATION=REQUIRED.

3. You are using binaries compiled for RHEL7... this is indeed running on a RHEL7 or Centos7 system, right?

4. If the above doesn't help, try putting STARTER_DEBUG = D_ALL in the condor_config on your worker node and run again. This time your StarterLog should contain a lot more messages which could help make it more obvious where things are going wrong.

5. If things aren't clarified from #4 above, you could make core.STARTER file available on the internet and send a message to the developers at htcondor-admin@xxxxxxxxxxx (or at htcondor-support@xxxxxxxxxxx if you have a support contract) telling them how to pick it up.


best regards
Todd



12/21/17 15:22:39 (pid:22864) ******************************************************
12/21/17 15:22:39 (pid:22864) ** condor_starter (CONDOR_STARTER) STARTING UP
12/21/17 15:22:39 (pid:22864) ** /opt/condor/sbin/condor_starter
12/21/17 15:22:39 (pid:22864) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 12/21/17 15:22:39 (pid:22864) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 12/21/17 15:22:39 (pid:22864) ** $CondorVersion: 8.6.8 Nov 13 2017 BuildID: 424045 $
12/21/17 15:22:39 (pid:22864) ** $CondorPlatform: x86_64_RedHat7 $
12/21/17 15:22:39 (pid:22864) ** PID = 22864
12/21/17 15:22:39 (pid:22864) ** Log last touched 12/21 15:22:30
12/21/17 15:22:39 (pid:22864) ****************************************************** 12/21/17 15:22:39 (pid:22864) Using config source: /opt/condor/etc/condor_config
12/21/17 15:22:39 (pid:22864) Using local config sources:
12/21/17 15:22:39 (pid:22864)Â Â /opt/condor/local/condor_config.local
12/21/17 15:22:39 (pid:22864) config Macros = 165, Sorted = 164, StringBytes = 5335, TablesBytes = 5988
12/21/17 15:22:39 (pid:22864) CLASSAD_CACHING is OFF
12/21/17 15:22:39 (pid:22864) Daemon Log is logging: D_ALWAYS D_ERROR
12/21/17 15:22:39 (pid:22864) SharedPortEndpoint: waiting for connections to named socket 19607_a7b6_4 12/21/17 15:22:39 (pid:22864) DaemonCore: command socket at <10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19607_a7b6_4 <http://10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19607_a7b6_4>> 12/21/17 15:22:39 (pid:22864) DaemonCore: private command socket at <10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19607_a7b6_4 <http://10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19607_a7b6_4>> 12/21/17 15:22:39 (pid:22864) Communicating with shadow <10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19606_f356_4 <http://10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP&sock=19606_f356_4>> 12/21/17 15:22:39 (pid:22864) Submitting machine is "njrarltapp001a8.mgmt.ams1907.com <http://njrarltapp001a8.mgmt.ams1907.com>"
12/21/17 15:22:39 (pid:22864) setting the orig job name in starter
12/21/17 15:22:39 (pid:22864) setting the orig job iwd in starter
12/21/17 15:22:39 (pid:22864) Chirp config summary: IO false, Updates false, Delayed updates true.
12/21/17 15:22:39 (pid:22864) Initialized IO Proxy.
12/21/17 15:22:39 (pid:22864) Done setting resource limits
12/21/17 15:22:39 (pid:22864) File transfer completed successfully.
Stack dump for process 22864 at timestamp 1513887760 (14 frames)
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(dprintf_dump_stack+0x72)[0x7fcd88c3cea2]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_Z18linux_sig_coredumpi+0x24)[0x7fcd88dc74a4]
/lib64/libpthread.so.0(+0xf5e0)[0x7fcd873165e0]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZNK17CondorVersionInfo19built_since_versionEiii+0x10)[0x7fcd88c94dd0]
condor_starter(REMOTE_CONDOR_dprintf_stats+0x39)[0x440629]
condor_starter(_ZN9JICShadow17transferCompletedEP12FileTransfer+0x13b)[0x4295fb]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN12FileTransfer6ReaperEP7Serviceii+0x1b8)[0x7fcd88c70a38]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10DaemonCore10CallReaperEiPKcii+0x12d)[0x7fcd88da73bd]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10DaemonCore17HandleProcessExitEii+0x1b9)[0x7fcd88da9ca9]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10DaemonCore24HandleDC_SERVICEWAITPIDSEi+0x7c)[0x7fcd88da9e6c]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10DaemonCore6DriverEv+0x6b2)[0x7fcd88daa552]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_Z7dc_mainiPPc+0x13a4)[0x7fcd88dcab04]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fcd86f65c05]
condor_starter[0x422840]



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685