[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] DAG File descriptor panic when quota is exceeded




I did a condor_rm earlier today on a 100k node DAG and Condor became intermittent then stopped responding for 45+ minutes. condor_restart and other attempts to revive it did not work (we only attempted these after about 30 minutes). Is this a possible side effect of the rescue DAG being created for a large DAG?

Thanks,

Ian

some more details:

30k nodes had completed, about 2k were queued, and only a handful were actively executing (less than 100). The job was submitted around 9pm, and at 9am today we could see that overnight nothing finished between about midnight and 4am (I don't have the log files available to me at the moment). We discovered there was some bad data in one of the key input files, hence the decision to cancel the DAG with the condor_rm.

--
Ian Stokes-Rees, Research Associate
SBGrid, Harvard Medical School
http://sbgrid.org

begin:vcard
fn:Ian Stokes-Rees
n:Stokes-Rees;Ian
org:Harvard Medical School;Biological Chemistry and Molecular Pharmacology
adr:250 Longwood Ave;;SGM-105;Boston;MA;02115;USA
email;internet:ijstokes@xxxxxxxxxxxxxxxxxxx
title:Research Associate, Sliz Lab
tel;work:+1.617.432.5608 x75
tel;fax:+1.617.432.5600
tel;cell:+1.617.331.5993
x-mozilla-html:TRUE
url:http:/sbgrid.org
version:2.1
end:vcard