[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached



Hi Carsten,

Seems that the starter and schedd logs point to an issue (potentially) with the shadow.  Did you happen to catch the logfiles from them?

Not entirely clear this is related to preemption.  Could be file I/O related.

Brian

> On May 21, 2020, at 12:02 PM, Carsten Aulbert <carsten.aulbert@xxxxxxxxxx> wrote:
> 
> Hi,
> 
> I hope everyone has fully recovered from the HTCondor week :)
> 
> Unfortunately, this problem only came up after yesterday's Q&A session
> finished, thus asking it in the usual space.
> 
> In our new configuration (config summary at the end) we try to enforce
> preemption based on priority (and maybe other factors in the future),
> but right now, jobs are killed earlier than what MaxJobRetirementTime is
> set to and we currently have no idea what is causing this?
> 
> The lines copied below seem to indicate networking problems, but I do
> not see any hints in the switches or on the nodes - especially the
> submit machine is quite busy but it should not overwhelm it (system is
> mostly idle but with about 5% wait, 10% system and mostly pegasus
> monitor processes in D state).
> 
> The service should have enough file handles:
> 
> condor.service - Condor Distributed High-Throughput-Computing
> Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor
> preset: enabled)
> Drop-In: /etc/systemd/system.control/condor.service.d
> ââ50-TasksMax.conf
> /etc/systemd/system/condor.service.d
> ââensure-run-dir.conf, limits.conf
> Active: active (running) since Thu 2020-05-21 08:30:44 UTC; 8h ago
> Process: 2440231 ExecStartPre=/bin/mkdir --parent --mode=0755
> /run/condor (code=exited, status=0/SUCCESS)
> Process: 2440232 ExecStartPre=/bin/chown condor:condor /run/condor
> (code=exited, status=0/SUCCESS)
> Main PID: 2440233 (condor_master)
> Status: "All daemons are responding"
> Tasks: 34255 (limit: 131071)
> Memory: 76.7G
> CGroup: /system.slice/condor.service
> 
> Thus, I would be grateful for any hint where to look.
> 
> Cheers
> 
> Carsten
> 
> In the SchedLog we see these worrisome lines:
> 
> 05/21/20 16:43:26 (2061018.0) (1646287): **** condor_shadow
> (condor_SHADOW) pid 1646287 EXITING WITH STATUS 108
> 05/21/20 16:43:26 (2061021.0) (1646309): **** condor_shadow
> (condor_SHADOW) pid 1646309 EXITING WITH STATUS 108
> 05/21/20 16:43:26 (2006521.0) (1624977): File transfer completed
> successfully.
> 05/21/20 16:43:26 (2001479.0) (1610449): File transfer completed
> successfully.
> 05/21/20 16:43:26 (2013966.0) (1637250): **** condor_shadow
> (condor_SHADOW) pid 1637250 EXITING WITH STATUS 108
> 05/21/20 16:43:26 (2054171.0) (1339388): condor_read(): Socket closed
> abnormally when trying to read 21 bytes from schedd at
> <10.20.30.17:14867>, errno=104 Connection reset by peer
> 05/21/20 16:43:26 (2054171.0) (1339388): SetEffectiveOwner(ahnitz)
> failed with errno=110: Connection timed out.
> 05/21/20 16:43:26 (2052187.0) (1313342): condor_read(): Socket closed
> abnormally when trying to read 21 bytes from schedd at
> <10.20.30.17:14867>, errno=104 Connection reset by peer
> 05/21/20 16:43:26 (2052187.0) (1313342): SetEffectiveOwner(ahnitz)
> failed with errno=110: Connection timed out.
> 05/21/20 16:43:26 (2031965.0) (1283878): condor_read(): Socket closed
> abnormally when trying to read 21 bytes from schedd at
> <10.20.30.17:14867>, errno=104 Connection reset by peer
> 05/21/20 16:43:26 (2031965.0) (1283878): SetEffectiveOwner(ahnitz)
> failed with errno=110: Connection timed out.
> 05/21/20 16:43:26 (2051766.0) (1308655): condor_read(): Socket closed
> abnormally when trying to read 21 bytes from schedd at
> <10.20.30.17:14867>, errno=104 Connection reset by peer
> 05/21/20 16:43:26 (2039452.0) (1293958): condor_read(): Socket closed
> abnormally when trying to read 21 bytes from schedd at
> <10.20.30.17:14867>, errno=104 Connection reset by peer
> 
> In StarterLogs these lines
> 
> 05/21/20 15:24:32 (pid:48660) Create_Process succeeded, pid=48814
> 05/21/20 15:24:40 (pid:48660) Failed to open '.update.ad' to read update
> ad: No such file or directory (2).
> 05/21/20 15:24:40 (pid:48660) Failed to open '.update.ad' to read update
> ad: No such file or directory (2).
> 05/21/20 16:45:03 (pid:48660) Connection to shadow may be lost, will
> test by sending whoami request.
> 05/21/20 16:45:03 (pid:48660) condor_write(): Socket closed when trying
> to write 37 bytes to <10.20.30.17:12065>, fd is 9
> 05/21/20 16:45:03 (pid:48660) Buf::write(): condor_write() failed
> 05/21/20 16:45:03 (pid:48660) i/o error result is 0, errno is 0
> 05/21/20 16:45:03 (pid:48660) Lost connection to shadow, waiting 2400
> secs for reconnect
> 05/21/20 16:45:27 (pid:48660) Got SIGTERM. Performing graceful shutdown.
> 05/21/20 16:45:27 (pid:48660) ShutdownGraceful all jobs.
> 05/21/20 16:45:28 (pid:48660) Process exited, pid=48814, status=143
> 05/21/20 16:45:28 (pid:48660) Returning from CStarter::JobReaper()
> 
> matching lines from StartLog:
> 05/21/20 15:11:00 slot1_64: New machine resource of type -1 allocated
> 05/21/20 15:11:01 slot1_64: Request accepted.
> 05/21/20 15:11:01 slot1_64: Remote owner is ahnitz@xxxxxxxxxxx
> 05/21/20 15:11:01 slot1_64: State change: claiming protocol successful
> 05/21/20 15:11:01 slot1_64: Changing state: Owner -> Claimed
> 05/21/20 15:21:01 slot1_64: Response problem from schedd
> <10.20.30.17:9618?addrs=10.20.30.17-9618&noUDP&sock=2440233_7254_3> on
> ALIVE job -1.-1.
> 05/21/20 15:21:06 slot1_64: Response problem from schedd
> <10.20.30.17:9618?addrs=10.20.30.17-9618&noUDP&sock=2440233_7254_3> on
> ALIVE job -1.-1.
> 05/21/20 15:21:11 slot1_64: Couldn't send ALIVE to schedd at
> <10.20.30.17:9618?addrs=10.20.30.17-9618&noUDP&sock=2440233_7254_3>
> 05/21/20 15:23:16 slot1_64: Got activate_claim request from shadow
> (10.20.30.17)
> 05/21/20 15:23:16 slot1_64: Remote job ID is 2049848.0
> 05/21/20 15:23:16 slot1_64: Got universe "VANILLA" (5) from request classad
> 05/21/20 15:23:16 slot1_64: State change: claim-activation protocol
> successful
> 05/21/20 15:23:16 slot1_64: Changing activity: Idle -> Busy
> 05/21/20 15:23:19 slot1_64: Failed to open
> '/local/condor/execute/dir_48660/.update.ad.tmp' for writing update ad:
> No such file or directory (2).
> 05/21/20 15:24:01 slot1_64: Failed to open
> '/local/condor/execute/dir_48660/.update.ad.tmp' for writing update ad:
> No such file or directory (2).
> 05/21/20 15:24:04 slot1_64: Failed to open
> '/local/condor/execute/dir_48660/.update.ad.tmp' for writing update ad:
> No such file or directory (2).
> 05/21/20 15:24:04 slot1_64: Failed to open
> '/local/condor/execute/dir_48660/.update.ad.tmp' for writing update ad:
> No such file or directory (2).
> 05/21/20 16:45:27 slot1_64: Got ACTIVATE_CLAIM while in Claimed/Busy
> state, ignoring.
> 05/21/20 16:45:27 slot1_64: Called deactivate_claim()
> 05/21/20 16:45:27 slot1_64: Changing state and activity: Claimed/Busy ->
> Preempting/Vacating
> 05/21/20 16:45:38 slot1_64: Got RELEASE_CLAIM while in Preempting state,
> ignoring.
> 
> 
> -- 
> Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
> CallinstraÃe 38, 30167 Hannover, Germany
> Phone: +49 511 762 17185
> 
> <submit.sum><central.sum><execute.sum>_______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/