[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_dagman not creating jobs



Submitting by hand also crashes. What do I need to do to track down the cause?

Thanks
Vlad


On 10/27/21 1:51 PM, Jaime Frey wrote:
This indeed looks like a proxy-related issue. condor_submit doesnât send the proxy to the schedd, but it does read the proxy to verify itâs valid.

Can you try submitting one of the affected node jobs by hand from the command line? If that also crashes, then itâll be easier to track down the cause.

  - Jaime

On Oct 27, 2021, at 1:02 PM, Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx> wrote:

Hello

I've run into an issue where dagman seems to be unable to create jobs because condor_submit segfaults.

.condor_dagman.out contains:
10/27/21 12:52:35 ERROR: submit attempt failed
10/27/21 12:52:35 submit command was: /usr/bin/condor_submit -a dag_node_name' '=' 'job2 -a submit_event_notes' '=' 'DAG' 'Node:' 'job2 -a dagman_log' '=' '/mnt/scratch/tyuan/refit/./refit.prob.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36" -a JOB=job2 -a OUTPUT_DIR' '=' '/data/user/tyuan/studies/tablemaker/refits/prob -a INPUT_DIR' '=' '/data/user/chill/photo-table -a FILE_NAME' '=' 'cascade_halftable_spice_3.2.1_flat_z0_zen100_azi180_nevents40000_0_range.fits -a DAG_STATUS' '=' '2 -a FAILED_COUNT' '=' '1 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"" refit.prob.sub
10/27/21 12:52:35 Job submit try 1/6 failed, will try again in >= 1 second.

dmesg contains:
[2335469.858471] condor_submit[2260162]: segfault at a ip 00007efd3f70e2cb sp 00007ffd24306b40 error 4 in libglobus_gsi_credential.so.1.6.14[7efd3f707000+9000]
[2335469.864387] Code: 00 48 c7 44 24 08 00 00 00 00 48 85 ff 74 07 e8 9b 93 ff ff 89 c5 4d 85 ff 74 3f 4c 8d 6c 24 08 49 8b 07 4c 89 ee 48 8b 40 20 <48> 8b 78 08 e8 bc 92 ff ff 85 c0 75 78 48 8b 03 48 8b 54 24 08 48

We are running version 9.0.6 on Centos 8.

My simple test dags seem to be fine, so it doesn't always fail. Perhaps it has something to do with sending x509 proxies with the jobs?

Any help would be appreciated.


Vlad


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/