[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_shadow "D" state in processes



Does the ShadowLog contain any clues about what the shadows are doing during the time of high load?

If not, it may be enlightening to run 'strace -p <pid of a shadow>' and see what the shadow is trying to do.

--Dan

Robert E. Parrott wrote:

Hi Folks,

I'm seeing an unfortunate behavior with condor_shadow jobs in the vanilla universe. this is LINUX X86_64 and condor v6.8.6.

A user submits a large number (500 -1000 ) or jobs on a cluster with 150 processors, and has about 100 jobs running simultaneously. These jobs all run for about 3 minutes, and then complete at nearly the same time. At this time, the load on the submit machine, which is also the head node, reaches a little over N, where N is the number of this user's running jobs.

Closer inspection shows that all of the condor_shadow processes owned by this user are in the "D" state, contending for what appears to be the same resources.

At first I thought that this was contention was the output data was returned from the compute nodes to the submit node. As such I asked the user to add

   initialdir = [ the run dir ]
   should_transfer_files = NO

To the submit file, but this doesn't help. Also, looking at the actual output, each job produces less than 20 K in output data.

What could be causing such contention in a vanilla universe condor_shadow job, if not the final file transfer process? Has anyone seen such behavior before in the vanilla universe? Any hints of guesses for things to look at?


thanks,
rob






_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/