[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Do Starters stop responding if they are queued for i/o?



Hi Todd,

I have some evidence (it's not conclusive) that if a Starter ends up queued to transfer data back to a shadow for a long period of time (i.e. in q> state because of low limits and high load on incoming condor file i/o), then the Starter can stop responding to the "are you alive" queries from the Startd and gets hard killed. The job then gets rescheduled. Here's the relevant parts of the user log, StartLog and StarterLog. The user job exits with a checkpoint at 04/03/19 18:32:31, and is waiting to transfer it's checkpoint back to the busy schedd machine.

Any ideas? Is this possible?

Cheers,
Duncan.

006 (6083422.000.000) 04/03 18:29:42 Image size of job updated: 12500000
	12020  -  MemoryUsage of job (MB)
	12307696  -  ResidentSetSize of job (KB)
...
022 (6083422.000.000) 04/03 19:24:20 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1@CRUSH-SUGWG-10-5-138-36 <10.5.138.36:9618?addrs=10.5.138.36-9618&noUDP&sock=2000_6980_3>
...
024 (6083422.000.000) 04/03 19:24:22 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@CRUSH-SUGWG-10-5-138-36, rescheduling job
...


04/03/19 19:24:20 ERROR: Child pid 3999 appears hung! Killing it hard.
04/03/19 19:24:20 Starter pid 3999 died on signal 9 (signal 9 (Killed))


04/03/19 12:01:22 (pid:3999) ******************************************************
04/03/19 12:01:22 (pid:3999) ** condor_starter (CONDOR_STARTER) STARTING UP
04/03/19 12:01:22 (pid:3999) ** /usr/sbin/condor_starter
04/03/19 12:01:22 (pid:3999) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
04/03/19 12:01:22 (pid:3999) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
04/03/19 12:01:22 (pid:3999) ** $CondorVersion: 8.8.1 Feb 18 2019 BuildID: 461773 PackageID: 8.8.1-1 $
04/03/19 12:01:22 (pid:3999) ** $CondorPlatform: x86_64_RedHat7 $
04/03/19 12:01:22 (pid:3999) ** PID = 3999
04/03/19 12:01:22 (pid:3999) ** Log last touched 4/3 11:13:34
04/03/19 12:01:22 (pid:3999) ******************************************************
04/03/19 12:01:22 (pid:3999) Using config source: /etc/condor/condor_config
04/03/19 12:01:22 (pid:3999) Using local config sources: 
04/03/19 12:01:22 (pid:3999)    /etc/condor/condor_config.local
04/03/19 12:01:22 (pid:3999) config Macros = 106, Sorted = 105, StringBytes = 3722, TablesBytes = 3864
04/03/19 12:01:22 (pid:3999) CLASSAD_CACHING is OFF
04/03/19 12:01:22 (pid:3999) Daemon Log is logging: D_ALWAYS D_ERROR
04/03/19 12:01:22 (pid:3999) SharedPortEndpoint: waiting for connections to named socket 2033_2eac_13
04/03/19 12:01:22 (pid:3999) DaemonCore: command socket at <10.5.138.36:9618?addrs=10.5.138.36-9618&noUDP&sock=2033_2eac_13>
04/03/19 12:01:22 (pid:3999) DaemonCore: private command socket at <10.5.138.36:9618?addrs=10.5.138.36-9618&noUDP&sock=2033_2eac_13>
04/03/19 12:01:22 (pid:3999) Communicating with shadow <10.5.2.4:9615?sock=1563698_f214_15858>
04/03/19 12:01:22 (pid:3999) Submitting machine is "10.5.2.4"
04/03/19 12:01:22 (pid:3999) setting the orig job name in starter
04/03/19 12:01:22 (pid:3999) setting the orig job iwd in starter
04/03/19 12:01:22 (pid:3999) Job has WantIOProxy=true
04/03/19 12:01:22 (pid:3999) Chirp config summary: IO true, Updates true, Delayed updates true.
04/03/19 12:01:22 (pid:3999) Initialized IO Proxy.
04/03/19 12:01:22 (pid:3999) Done setting resource limits
04/03/19 14:44:10 (pid:3999) File transfer completed successfully.
04/03/19 14:44:10 (pid:3999) Job 6083422.0 set to execute immediately
04/03/19 14:44:10 (pid:3999) Starting a VANILLA universe job with ID: 6083422.0
04/03/19 14:44:10 (pid:3999) Current mount, /, is shared.
04/03/19 14:44:10 (pid:3999) IWD: /var/lib/condor/execute/dir_3999
04/03/19 14:44:10 (pid:3999) Output file: /var/lib/condor/execute/dir_3999/_condor_stdout
04/03/19 14:44:10 (pid:3999) Error file: /var/lib/condor/execute/dir_3999/_condor_stderr
04/03/19 14:44:10 (pid:3999) Renice expr "0" evaluated to 0
04/03/19 14:44:10 (pid:3999) Using wrapper /etc/condor/sugwg-job-wrapper.sh to exec /usr/libexec/condor/condor_pid_ns_init /home/daniel.finstad/opt/pisn/bin/pycbc_inference --processing-scheme cpu --nprocesses 24 --fake-strain-from-file H1:/home/daniel.finstad/projects/pycbc-pisn-paper/data/asd_june_2016/aLIGO_design.txt L1:/home/daniel.finstad/projects/pycbc-pisn-paper/data/asd_june_2016/aLIGO_design.txt V1:/home/daniel.finstad/projects/pycbc-pisn-paper/data/asd_june_2016/AdVirgo.txt --asd-file H1:/home/daniel.finstad/projects/pycbc-pisn-paper/data/asd_june_2016/aLIGO_design.txt L1:/home/daniel.finstad/projects/pycbc-pisn-paper/data/asd_june_2016/aLIGO_design.txt V1:/home/daniel.finstad/projects/pycbc-pisn-paper/data/asd_june_2016/AdVirgo.txt --fake-strain-seed 0 --pad-data 8 --strain-high-pass 5 --sample-rate 2048 --low-frequency-cutoff 10 --verbose --force --inj-seed 2345 --instruments H1 L1 V1 --gps-start-time 1126259454 --gps-end-time 1126259470 --channel-name V1:V1:LOSC-STRAIN H1:H1:LOSC-STRAIN L1:L1:LOSC-STRAIN --config-file pisn_inference_large_eoc.ini --fake-strain-seed V1:158 H1:156 L1:157 --injection-file H1L1V1-CREATE_INJECTIONS_2397-1126259454-16.hdf --seed 52 --output-file H1L1V1-INFERENCE_2397-1126259454-16.hdf
04/03/19 14:44:10 (pid:3999) Running job as user daniel.finstad
04/03/19 14:44:10 (pid:3999) Create_Process succeeded, pid=6356
04/03/19 14:44:10 (pid:3999) Limiting (soft) memory usage to 0 bytes
04/03/19 14:44:10 (pid:3999) Limiting memsw usage to 9223372036854775807 bytes
04/03/19 14:44:10 (pid:3999) Limiting (soft) memory usage to 21474836480 bytes
04/03/19 14:44:10 (pid:3999) Limiting (hard) memory usage to 47148670976 bytes
04/03/19 14:44:10 (pid:3999) Limiting memsw usage to 47148675072 bytes
04/03/19 18:32:31 (pid:3999) Process exited, pid=6356, status=0


-- 

Duncan Brown                              Room 263-1, Physics Department
Charles Brightman Professor of Physics     Syracuse University, NY 13244
http://dabrown.expressions.syr.edu                   Phone: 315 443 5993