[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trouble forking many child processes from a Condor job



Ignore this. I'm an idiot. My main thread is finishing *before* the
forked thread and Condor is absolutely, 100%, Doing The Right Thing(tm)
here by cleaning up the processes.

I need to wait at the end of my bash script for the children to complete
before finishing.

Must be Monday...

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Monday, March 23, 2009 5:27 PM
To: 'Condor-Users Mail List'
Subject: Trouble forking many child processes from a Condor job

I'm running into a very strange problem here with a test job. It's a job
that enters my Condor system (via a fetch work hook) and spawns one
hundred more jobs in parallel via a Stupidly Simple Bash Script(tm).

The script looks like this:

   #!/bin/bash

   NUM_CHILD=100
   echo "*********************** create-this.sh **********************"
   echo Simulates Create This Test
   echo
   echo Launch $NUM_CHILD jobs each running for 15 seconds
   for ((a=1; a <= NUM_CHILD ; a++))
   do
      echo Launching job $a
      arc-job-submit -t name="Create This $a" -- jobs/leaf_job.sh 2 &
   done
   echo
   # Annotate number of expected child jobs
   arc-job-info num_children=$NUM_CHILD

The arc-job-submit command puts more jobs in my database where my fetch
work hook pulls jobs to run from.

If I run this script manually the jobs enter my system and run properly
on Condor nodes. There's no problem with the background fork on the
machine.

If I submit this script to run via Condor less than half my jobs enter
my database. More than half the calls to arc-job-submit die mysteriously
mid-way through creating the jobs in the database. The point-of-death is
never the same, it appears to be quite random.

If I drop the fork-to-background & and submit the job to run via Condor
all 100 jobs enter the system successfully.

I tried it under 7.2.1 on CentOS 4 both with:

USE_PROCD = True

And:

USE_PROCD = False

No difference. When the script is run under Condor the success rate for
the fork'ed ace-job-submit calls is always atrociously low.

I find this quite strange. Condor has always seemed to move well out of
the way of my job processes. Never interfering with this kind of
spawning. Is there something new in 7.2.x that might be preventing a
very fast child thread startup rate?

There's absolutely nothing in the StarterLog on the machine that runs
the bash script to indicate that Condor is killing spawned child
processes on me. Here's a sample StarterLog output for a job placed on a
machine via my fetch work hook that runs this bash script to spawn 100
more jobs:

3/23 14:01:16 ******************************************************
3/23 14:01:16 Using config source: /etc/condor/condor_config
3/23 14:01:16 Using local config sources:
3/23 14:01:16    /tools/arc/condor/condor_config.basic
3/23 14:01:16    /tools/arc/condor/os/condor_config.LINUX
3/23 14:01:16    /tools/arc/condor/site/condor_config.SJDEV
3/23 14:01:16    /tools/arc/condor/machine/condor_config.sqal08
3/23 14:01:16    /tools/arc/condor/machine/condor_config.sqal08.LINUX
3/23 14:01:16    /tools/arc/condor/patch/condor_config.sqal08
3/23 14:01:16    /tools/arc/condor/patch/condor_config.sqal08.LINUX
3/23 14:01:16    /tools/arc/condor/cycleserver/sqal08.config
3/23 14:01:16 DaemonCore: Command Socket at <137.57.202.213:48941>
3/23 14:01:16 Done setting resource limits
3/23 14:01:16 Starter running a local job with no shadow
3/23 14:01:16 Reading job ClassAd from "STDIN"
3/23 14:01:16 Found ClassAd data in "STDIN"
3/23 14:01:16 setting the orig job name in starter
3/23 14:01:16 setting the orig job iwd in starter
3/23 14:01:16 Job 1.0 set to execute immediately
3/23 14:01:17 Starting a VANILLA universe job with ID: 1.0
3/23 14:01:17 IWD: /data/tmp/compute_farm_load_test
3/23 14:01:17 Output file:
/data/sslany/job/20090323/1400/39140/stdout.txt
3/23 14:01:17 Error file:
/data/sslany/job/20090323/1400/39140/stderr.txt
3/23 14:01:17 Renice expr "((False =?= True) * 10)" evaluated to 0
3/23 14:01:17 About to exec /tools/arc/scripts/arc_execute.sh 39140
3/23 14:01:17 Create_Process succeeded, pid=23245
3/23 14:01:56 Process exited, pid=23245, status=0
3/23 14:01:56 All jobs have exited... starter exiting
3/23 14:01:56 **** condor_starter (condor_STARTER) pid 23243 EXITING
WITH STATUS 0

>From that call only 23 of the 100 arc-job-submit calls completed
successfully. The other 77 were terminated at random points.

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.