[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Attempts to connect ARC to Condor, causing condor_schedd to crash



On Feb 9, 2015, at 4:09 AM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:

Hello all,

We're currently in the process of getting ARC CE working with HTCondor.

We can submit jobs through ARC CE and when querying through arcstat they show as Accepted.
arctest -c <host>.cern.ch -J 1

[]# arcstat -a
Job: gsiftp://<host>.cern.ch:2811/jobs/CZ3NDmnjyelniof7oo6yBzRnABFKDmABFKDm1qGKDmABFKDmnSpOan
 Name: arctest1
 State: Accepted

Status of 1 jobs was queried, 1 jobs returned information

However it never goes further than this.

After some debugging and log checking I've discovered that it appears condor_schedd is crashing out after a submit.

A look at /var/spool/job.helper.errors gives the following output:

-- Failed to fetch ads from: <ip:39890> : <host>.cern.ch
CEDAR:6001:Failed to connect to <ip:39890>
Non-zero exit status returned by /usr/bin/condor_q
[2015-02-09 08:30:09] scan-condor-job: lrms_list_jobs failed

A look at the troubleshooting guide appears to indicate that ALLOW_READ might be behind this.

However, as per the guide it isn't defined:

# condor_config_val -v ALLOW_READ
Not defined: ALLOW_READ

We are currently running condor-8.3.2-288596.x86_64 and 4.2.0-1 of ARC CE.

Any suggestions or help you could provide would be much appreciated.

Regards,

Iain

Other relevant logs:

/var/log/condor/SchedLog gives the following:

02/09/15 10:59:34 (pid:6762) Received a superuser command
02/09/15 10:59:34 (pid:6762) Number of Active Workers 0
02/09/15 10:59:35 (pid:6491) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
02/09/15 10:59:36 (pid:6491) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/09/15 10:59:36 (pid:6491) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/09/15 10:59:37 (pid:6508) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
02/09/15 10:59:37 (pid:6508) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/09/15 10:59:37 (pid:6508) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/09/15 10:59:39 (pid:6541) procd (pid = 28813) exited unexpectedly with status 256
02/09/15 10:59:39 (pid:6541) attempting to restart the Procd

/var/log/condor/ProcLog has the following events:

02/09/15 10:59:40 : ProcAPI: new boottime = 1421741930; old_boottime = 1421741930; /proc/stat boottime = 1421741930; /proc/uptime boottime = 1421741930
02/09/15 10:59:40 : process 28813 (not in monitored family) has exited
02/09/15 10:59:40 : process 28812 (not in monitored family) has exited
02/09/15 10:59:40 : process 28811 (not in monitored family) has exited
02/09/15 10:59:40 : process 28632 (not in monitored family) has exited
02/09/15 10:59:40 : process 28630 (not in monitored family) has exited
02/09/15 10:59:40 : process 28629 (not in monitored family) has exited
02/09/15 10:59:40 : process 28612 (not in monitored family) has exited
02/09/15 10:59:40 : process 28611 (not in monitored family) has exited
02/09/15 10:59:40 : process 28610 (not in monitored family) has exited
02/09/15 10:59:40 : process 28609 (not in monitored family) has exited
02/09/15 10:59:40 : process 28608 (not in monitored family) has exited
02/09/15 10:59:40 : process 28607 (not in monitored family) has exited
02/09/15 10:59:40 : process 28606 (not in monitored family) has exited
02/09/15 10:59:40 : process 28603 (not in monitored family) has exited
02/09/15 10:59:40 : process 28602 (not in monitored family) has exited
02/09/15 10:59:40 : process 28601 (not in monitored family) has exited
02/09/15 10:59:40 : no methods have determined process 28816 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28817 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28819 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28822 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28823 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28824 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28825 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28826 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28827 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28853 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28870 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28871 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28873 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28939 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 28944 to be in a monitored family
02/09/15 10:59:40 : no methods have determined process 29056 to be in a monitored family
02/09/15 10:59:40 : ...snapshot complete

The log snippets you provide don’t show what could be going wrong with the schedd, though a crash is a likely explanation for the error you’re getting from condor_q. If the schedd crashes, there will be an entry in the MasterLog giving the exit status. When the schedd is restarted, it will print a banner like the following. Right above that will be the last things it logged before exiting

01/28/15 08:09:06 Setting maximum file descriptors to 4096.
01/28/15 08:09:06 ******************************************************
01/28/15 08:09:06 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
01/28/15 08:09:06 ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
01/28/15 08:09:06 ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
01/28/15 08:09:06 ** $CondorVersion: 8.2.7 Jan 23 2015 BuildID: 295090 $
01/28/15 08:09:06 ** $CondorPlatform: x86_64_RedHat6 $
01/28/15 08:09:06 ** PID = 22424
01/28/15 08:09:06 ** Log last touched 1/28 08:08:34
01/28/15 08:09:06 ******************************************************

If you can send me those bits of log, that should help explain what’s going wrong.

Are you able to successfully submit and run jobs but submitting directly to HTCondor locally?

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project