[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_dagman not creating jobs



If youâre familiar with the basics of gdb, running condor_submit under gdb and getting a stack trace at the point of crash is a good start.

 - Jaime

> On Oct 27, 2021, at 3:20 PM, Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx> wrote:
> 
> Submitting by hand also crashes. What do I need to do to track down the cause?
> 
> Thanks
> Vlad
> 
> 
> On 10/27/21 1:51 PM, Jaime Frey wrote:
>> This indeed looks like a proxy-related issue. condor_submit doesnât send the proxy to the schedd, but it does read the proxy to verify itâs valid.
>> Can you try submitting one of the affected node jobs by hand from the command line? If that also crashes, then itâll be easier to track down the cause.
>>  - Jaime
>>> On Oct 27, 2021, at 1:02 PM, Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx> wrote:
>>> 
>>> Hello
>>> 
>>> I've run into an issue where dagman seems to be unable to create jobs because condor_submit segfaults.
>>> 
>>> .condor_dagman.out contains:
>>> 10/27/21 12:52:35 ERROR: submit attempt failed
>>> 10/27/21 12:52:35 submit command was: /usr/bin/condor_submit -a dag_node_name' '=' 'job2 -a submit_event_notes' '=' 'DAG' 'Node:' 'job2 -a dagman_log' '=' '/mnt/scratch/tyuan/refit/./refit.prob.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36" -a JOB=job2 -a OUTPUT_DIR' '=' '/data/user/tyuan/studies/tablemaker/refits/prob -a INPUT_DIR' '=' '/data/user/chill/photo-table -a FILE_NAME' '=' 'cascade_halftable_spice_3.2.1_flat_z0_zen100_azi180_nevents40000_0_range.fits -a DAG_STATUS' '=' '2 -a FAILED_COUNT' '=' '1 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"" refit.prob.sub
>>> 10/27/21 12:52:35 Job submit try 1/6 failed, will try again in >= 1 second.
>>> 
>>> dmesg contains:
>>> [2335469.858471] condor_submit[2260162]: segfault at a ip 00007efd3f70e2cb sp 00007ffd24306b40 error 4 in libglobus_gsi_credential.so.1.6.14[7efd3f707000+9000]
>>> [2335469.864387] Code: 00 48 c7 44 24 08 00 00 00 00 48 85 ff 74 07 e8 9b 93 ff ff 89 c5 4d 85 ff 74 3f 4c 8d 6c 24 08 49 8b 07 4c 89 ee 48 8b 40 20 <48> 8b 78 08 e8 bc 92 ff ff 85 c0 75 78 48 8b 03 48 8b 54 24 08 48
>>> 
>>> We are running version 9.0.6 on Centos 8.
>>> 
>>> My simple test dags seem to be fine, so it doesn't always fail. Perhaps it has something to do with sending x509 proxies with the jobs?
>>> 
>>> Any help would be appreciated.
>>> 
>>> 
>>> Vlad
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/