[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor_shadow "D" state in processes
- Date: Tue, 4 Dec 2007 11:13:01 -0500
- From: "Robert E. Parrott" <parrott@xxxxxxxxxxxxxxxx>
- Subject: [Condor-users] condor_shadow "D" state in processes
I'm seeing an unfortunate behavior with condor_shadow jobs in the
vanilla universe. this is LINUX X86_64 and condor v6.8.6.
A user submits a large number (500 -1000 ) or jobs on a cluster with
150 processors, and has about 100 jobs running simultaneously. These
jobs all run for about 3 minutes, and then complete at nearly the
same time. At this time, the load on the submit machine, which is
also the head node, reaches a little over N, where N is the number of
this user's running jobs.
Closer inspection shows that all of the condor_shadow processes owned
by this user are in the "D" state, contending for what appears to be
the same resources.
At first I thought that this was contention was the output data was
returned from the compute nodes to the submit node. As such I asked
the user to add
initialdir = [ the run dir ]
should_transfer_files = NO
To the submit file, but this doesn't help. Also, looking at the
actual output, each job produces less than 20 K in output data.
What could be causing such contention in a vanilla universe
condor_shadow job, if not the final file transfer process? Has
anyone seen such behavior before in the vanilla universe? Any hints
of guesses for things to look at?