[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Attempts to connect ARC to Condor, causing condor_schedd to crash



Hello all,

We're currently in the process of getting ARC CE working with HTCondor.

We can submit jobs through ARC CE and when querying through arcstat they show as Accepted.
arctest -c <host>.cern.ch -J 1

[]# arcstat -a
Job: gsiftp://<host>.cern.ch:2811/jobs/CZ3NDmnjyelniof7oo6yBzRnABFKDmABFKDm1qGKDmABFKDmnSpOan
 Name: arctest1
 State: Accepted

Status of 1 jobs was queried, 1 jobs returned information

However it never goes further than this.

After some debugging and log checking I've discovered that it appears condor_schedd is crashing out after a submit.

A look at /var/spool/job.helper.errors gives the following output:

-- Failed to fetch ads from: <ip:39890> : <host>.cern.ch
CEDAR:6001:Failed to connect to <ip:39890>
Non-zero exit status returned by /usr/bin/condor_q
[2015-02-09 08:30:09] scan-condor-job: lrms_list_jobs failed

A look at the troubleshooting guide appears to indicate that ALLOW_READ might be behind this.

However, as per the guide it isn't defined:

# condor_config_val -v ALLOW_READ
Not defined: ALLOW_READ

We are currently running condor-8.3.2-288596.x86_64 and 4.2.0-1 of ARC CE.

Any suggestions or help you could provide would be much appreciated.

Regards,

Iain

Other relevant logs:

/var/log/condor/SchedLog gives the following:

02/09/15 10:59:34 (pid:6762) Received a superuser command
02/09/15 10:59:34 (pid:6762) Number of Active Workers 0
02/09/15 10:59:35 (pid:6491) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
02/09/15 10:59:36 (pid:6491) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/09/15 10:59:36 (pid:6491) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/09/15 10:59:37 (pid:6508) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
02/09/15 10:59:37 (pid:6508) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/09/15 10:59:37 (pid:6508) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/09/15 10:59:39 (pid:6541) procd (pid = 28813) exited unexpectedly with status 256
02/09/15 10:59:39 (pid:6541) attempting to restart the Procd

/var/log/condor/ProcLog has the following events:

02/09/15 10:59:40 : ProcAPI: new boottime = 1421741930; old_boottime = 1421741930; /proc/stat boottime = 1421741930; /proc/uptime boottime = 1421741930
02/09/15 10:59:40 : process 28813 (not in monitored family) has exited
02/09/15 10:59:40 : process 28812 (not in monitored family) has exited
02/09/15 10:59:40 : process 28811 (not in monitored family) has exited
02/09/15 10:59:40 : process 28632 (not in monitored family) has exited
02/09/15 10:59:40 : process 28630 (not in monitored family) has exited
02/09/15 10:59:40 : process 28629 (not in monitored family) has exited
02/09/15 10:59:40 : process 28612 (not in monitored family) has exited
02/09/15 10:59:40 : process 28611 (not in monitored family) has exited
02/09/15 10:59:40 : process 28610 (not in monitored family) has exited
02/09/15 10:59:40 : process 28609 (not in monitored family) has exited
02/09/15 10:59:40 : process 28608 (not in monitored family) has exited
02/09/15 10:59:40 : process 28607 (not in monitored family) has exited
02/09/15 10:59:40 : process 28606 (not in monitored family) has exited
02/09/15 10:59:40 : process 28603 (not in monitored family) has exited
02/09/15 10:59:40 : process 28602 (not in monitored family) has exited
02/09/15 10:59:40 : process 28601 (not in monitored family) has exited
02/09/15 10:59:40 : no methods have determined process 28816 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28817 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28819 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28822 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28823 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28824 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28825 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28826 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28827 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28853 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28870 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28871 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28873 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28939 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28944 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 29056 to be in a monitored family
02/09/15 10:59:40 : ...snapshot complete