[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Fwd: Job not completing when run in Out-of-Core



Hello all,

We are using condor version 7.8.8 at our office.  We mainly use our compute 
nodes to solve CEM problems in large batches (usually one simulation per 
frequency).  Condor works great most off the time when the simulations are done 
in-core (in local RAM).  However, when run in out-of-core modes (in local HDD) 
for really large problems the simulations get "stuck" in condor.  When you 
condor_ssh_to_job, the simulation output file states that the simulation 
completed normally.  However, the job continues to be active in condor.  In 
the node logs the PID thread for the job never completes.  These out-of-core 
files are saved when the simulation completes (the impedance matrix can be 
reused) and are usually over 100 GB in size.  Is there a file size limit to 
what condor can return?  Does the fact that the JOB_SIZE is so much larger 
than what it predicts at give it issues?

Thanks for your time,

Michael Murphy
Engineer
IERUS Technologies, Inc.
2904 Westcorp Blvd, Ste 210
Huntsville, AL  35805
(256) 319-2026 ext 007