[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_starter processes at 100% CPU on Debian after completing job, before transferring files back



You'll want to read the entire thread. The options are:

1. disable cgroups entirely (Greg's suggestion).
2. configure the kernel to be able to use cgroups correctly (mine)

In #2, I don't explicitly say "run update-grub and reboot". I also had a Condor week presentation that is appropriate to systemd platforms like yours.


--
Tom Downes
Senior Scientist and Data CenterÂManager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678

On Fri, Jul 14, 2017 at 7:46 AM, Samik Raychaudhuri <samikr@xxxxxxxxx> wrote:
Hi Andrew,
This is a known problem with Debian - I have faced this recently. Please check out this recent thread: [https://www-auth.cs.wisc.edu/lists/htcondor-users/2017-April/msg00059.shtml]
Thanks.
-Samik

On 14-Jul-17 4:08 AM, Andrew Cunningham wrote:
I am running Condor on Debian as a compute node. The Condor host (and also compute node) is Red Hat 7, running 8.6.4
The submit node is a Windows 7 machine running 8.6.4 64-bit.
The job is a vanilla job that sends some files and returns some files from a computation.

$CondorVersion: 8.6.4 Jun 21 2017 BuildID: 408625 $
$CondorPlatform: x86_64_Debian8 $

I have divided the debian machine into 8 equal slots
more . The condor_config.local only was modified

CONDOR_HOST=condor
ALLOW_WRITE = $(FULL_HOSTNAME),*.mydomainname.com
NUM_SLOTS =8
DAEMON_LIST=MASTER STARTD

When I submit a job to our "cluster" of 2 machines, the Debian machine completes the jobs but the condor_starter processes get stuck at 100% CPU seemingly spinning their wheels trying to transfer the files back to the submit host.
The other Linux node, completes its jobs and sends back the computed data with no issues.

Examining the /var/lib/condor/execute directory shows the execute directories, with the output all normal and complete.
Â
Nothing in the Starter logs on the Debian node indicates any issues.

07/13/17 15:25:33 (pid:34323) Communicating with shadow <192.168.0.211:9618?addrs=192.168.0.211-9618&noUDP&sock=9892_fcaa_3>
07/13/17 15:25:33 (pid:34323) Submitting machine is "bose"
07/13/17 15:25:33 (pid:34323) setting the orig job name in starter
07/13/17 15:25:33 (pid:34323) setting the orig job iwd in starter
07/13/17 15:25:33 (pid:34323) SLOT2_USER set, so running job as acu
07/13/17 15:25:33 (pid:34323) Chirp config summary: IO false, Updates false, Delayed updates true.
07/13/17 15:25:33 (pid:34323) Initialized IO Proxy.
07/13/17 15:25:33 (pid:34323) Done setting resource limits
07/13/17 15:25:33 (pid:34323) File transfer completed successfully.
07/13/17 15:25:33 (pid:34323) Job 84.1 set to execute immediately
07/13/17 15:25:33 (pid:34323) Starting a VANILLA universe job with ID: 84.1
07/13/17 15:25:33 (pid:34323) IWD: /var/lib/condor/execute/dir_34323
07/13/17 15:25:33 (pid:34323) Output file: /var/lib/condor/execute/dir_34323/_condor_stdout
07/13/17 15:25:33 (pid:34323) Error file: /var/lib/condor/execute/dir_34323/_condor_stderr
07/13/17 15:25:33 (pid:34323) Renice expr "0" evaluated to 0
07/13/17 15:25:33 (pid:34323) About to exec /var/lib/condor/execute/dir_34323/condor_exec.exe VA22_2017 BEM_Fluid 1 27012@licenseserver
07/13/17 15:25:33 (pid:34323) Running job as user acu
07/13/17 15:25:33 (pid:34323) Create_Process succeeded, pid=34340
07/13/17 15:25:33 (pid:34323) Cgroup controller for memory accounting is not available.



ÂI am forced to restart condor on this node to stop the condor_starter processes.

Thanks for any advice.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/