[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_shadow "D" state in processes



Hi Folks,

I'm seeing an unfortunate behavior with condor_shadow jobs in the vanilla universe. this is LINUX X86_64 and condor v6.8.6.

A user submits a large number (500 -1000 ) or jobs on a cluster with 150 processors, and has about 100 jobs running simultaneously. These jobs all run for about 3 minutes, and then complete at nearly the same time. At this time, the load on the submit machine, which is also the head node, reaches a little over N, where N is the number of this user's running jobs.

Closer inspection shows that all of the condor_shadow processes owned by this user are in the "D" state, contending for what appears to be the same resources.

At first I thought that this was contention was the output data was returned from the compute nodes to the submit node. As such I asked the user to add

   initialdir = [ the run dir ]
   should_transfer_files = NO

To the submit file, but this doesn't help. Also, looking at the actual output, each job produces less than 20 K in output data.

What could be causing such contention in a vanilla universe condor_shadow job, if not the final file transfer process? Has anyone seen such behavior before in the vanilla universe? Any hints of guesses for things to look at?


thanks,
rob