[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs put on Hold at file transfer



I have condor installed on two IBM blade HS20 servers running windows 2003.

Blade 1 is master and can run jobs on Blade 1 and Blade 2 (slave) as long as they do

not attempt to transfer files back to the master Blade 1.

When a job running on Blade 2 finishes, and tries to send files back to Blade 1,

the job is put on hold (H) in the condor queue, and never completes. 

 

I can ping between the two blades, so they are seeing each other.

 

The condor_status shows that all blade processors are available (there are 8 since these blades are

each quad processors).

 

For jobs running on Blade 2, condor transfers files from Blade 1 to Blade 2 prior to executing

(I can see the files in the C:\condor\execute\ folder on Blade 2, and can see the job

executing to completion on Blade2).

 

Condor_q indicates that all jobs are running (R ) until any one running on Blade 2 tries to transfer its

output files back to Blade 1 (at which point those jobs get put on hold (H) according to condor_q).

 

Jobs that don’t require file transfer back to Blade 1 (eg, a test job that just transfers back the condor.out file)

work fine (ie, do not get put on hold) and exit the queue normally.

.

Does any one know what I can do about this?

 

Thanks,

Diane