[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Suddenly all jobs stuck in idle ("Can't receive eom from schedd")



Dear John,

indeed, somehow some nodes (including the head) got updated to version 8.2.3 while most of the rest stayed to 8.0.7. After performing updates all over the pool, it appears that the pool is working properly.Â

Thanks a lot for your help!

Regards,
Nikiforos

On Thu, Oct 16, 2014 at 12:22 AM, John (TJ) Knoeller <johnkn@xxxxxxxxxxx> wrote:
slot1: Can't receive eom from schedd

sounds to me like a protocol mismatch. What is the HTcondor version of the startd and of the schedd?

-tj


On 10/15/2014 9:59 AM, Nikiforos Nikiforou wrote:
Hello,

I had set up a condor pool of ~140 slots, consisting of ÃÂmy desktop (call it pchead) - 4 slots, 1 listening to keyboard activity) as a Master, Scheduler, Submission AND execution node, and a set of other desktops and VMs as execution nodes. It had been working excellently till this morning - I had been however away for a few days, so I can't exactly pinpoint when the problem appeared.ÃÂ

There was a network glitch this morning and I restarted the master node and since then, when I submit my jobs, only the jobs matched to the 3 slots on pclcd21 run (the 4th being reserved for the owner). All other slots refuse to start jobs and stick to Idle. The job resource requirements are not likely to be the problem since I am running on the same machines and the exact same jobs. Furthermore, when I do condor_q -better-analyze I get:

212899.000: ÃÂRun analysis summary.ÃÂ Of 138 machines,
ÃÂ ÃÂ ÃÂ 0 are rejected by your job's requirementsÃÂ
ÃÂ ÃÂ ÃÂ 1 reject your job because of their own requirementsÃÂ
ÃÂ ÃÂ ÃÂ 3 match and are already running your jobsÃÂ
ÃÂ ÃÂ ÃÂ 0 match but are serving other usersÃÂ
ÃÂ ÃÂ 134 are available to run your job

The Requirements _expression_ for your job is:

ÃÂ ÃÂ ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
ÃÂ ÃÂ ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
ÃÂ ÃÂ ( TARGET.HasFileTransfer )

Your job defines the following attributes:

ÃÂ ÃÂ DiskUsage = 75000
ÃÂ ÃÂ ImageSize = 35
ÃÂ ÃÂ RequestDisk = 75000
ÃÂ ÃÂ RequestMemory = 1

The Requirements _expression_ for your job reduces to these conditions:

ÃÂ ÃÂ ÃÂ ÃÂ ÃÂSlots
Step ÃÂ ÃÂMatched ÃÂCondition
----- ÃÂ-------- ÃÂ---------
[0] ÃÂ ÃÂ ÃÂ ÃÂ 138 ÃÂTARGET.Arch == "X86_64"
[1] ÃÂ ÃÂ ÃÂ ÃÂ 138 ÃÂTARGET.OpSys == "LINUX"
[3] ÃÂ ÃÂ ÃÂ ÃÂ 138 ÃÂTARGET.Disk >= RequestDisk
[5] ÃÂ ÃÂ ÃÂ ÃÂ 138 ÃÂTARGET.Memory >= RequestMemory
[7] ÃÂ ÃÂ ÃÂ ÃÂ 138 ÃÂTARGET.HasFileTransfer


Therefore I would conclude that the job resource requirements is not the issue.

From my investigations, everything points to a network/misconfiguration issue, however, I was unable to pinpoint where the problem is. For completeness, I should mention that all machines are within a private network, inaccessible from the outside world. Therefore, it's safe enough to disable the firewall to allow free communication between the nodes. Indeed, I have done just that but it does not seem to fix the problem, nodes with the firewall up and nodes with the firewall down exhibit the same behavior, even with the head node's (pchead) firewall down. It could be that the admins have modified the network somehow to restrict traffic, but I have not seen any relevant announcement. In addition, some quick scans I performed show that some ports including 9618 are accessible on the head node.ÃÂ

For simplicity, I removed all but one execution node (call it "pcnodevm00" with two slots) by stopping all daemons on all other machines, and removing the hostnames from ALLOW_WRITE on the head node configuration For the head node, I also set the START flag to FALSE to avoid starting any jobs there.ÃÂ
Looking at the log files, I see on pchead's CollectorLog messages like

10/15/14 16:39:58 StartdPvtAd ÃÂ: Inserting ** "< slot1@pcnodevm00 , XXX.XXX.XXX.XXX >"
10/15/14 16:39:58 stats: Inserting new hashent for 'StartdPvt':'slot1@pcnodevm00':'XXX.XXX.XXX.XXX'

which look fine. The daemons at pcnodevm00 look running (from pstree) and the "StartLog" on the node looks fine, ending with:
10/15/14 16:40:18 slot1: Changing activity: Benchmarking -> Idle


I then proceed with submitting 6 jobs. I now see all the jobs stuck in Idle state as expected. Looking at StartLog on pcnodevm00 again I see messages like (where XXX.XXX.XXX.XXX is the pcnodevm00 machine ip address):


0/15/14 16:52:03 slot1: Can't receive eom from schedd
10/15/14 16:52:03 slot1: State change: claiming protocol failed
10/15/14 16:52:03 slot1: Changing state: Unclaimed -> Owner
10/15/14 16:52:03 slot1: State change: IS_OWNER is false
10/15/14 16:52:03 slot1: Changing state: Owner -> Unclaimed
10/15/14 16:52:04 slot2: Can't receive eom from schedd
10/15/14 16:52:04 slot2: State change: claiming protocol failed
10/15/14 16:52:04 slot2: Changing state: Unclaimed -> Owner
10/15/14 16:52:04 slot2: State change: IS_OWNER is false
10/15/14 16:52:04 slot2: Changing state: Owner -> Unclaimed
10/15/14 16:52:04 Error: can't find resource with ClaimId (<XXX.XXX.XXX.XXX:43045>#1413383989#2#...)
10/15/14 16:52:04 Error: can't find resource with ClaimId (<XXX.XXX.XXX.XXX:43045>#1413383989#1#...)
10/15/14 16:53:03 slot1: match_info called
10/15/14 16:53:03 slot1: Received match <XXX.XXX.XXX.XXX:43045>#1413383989#3#...
10/15/14 16:53:03 slot1: State change: match notification protocol successful
10/15/14 16:53:03 slot1: Changing state: Unclaimed -> Matched
10/15/14 16:53:03 slot2: match_info called
10/15/14 16:53:03 slot2: Received match <XXX.XXX.XXX.XXX:43045>#1413383989#4#...
10/15/14 16:53:03 slot2: State change: match notification protocol successful
10/15/14 16:53:03 slot2: Changing state: Unclaimed -> Matched
10/15/14 16:53:03 slot1: Can't receive eom from schedd
10/15/14 16:53:03 slot1: State change: claiming protocol failed
10/15/14 16:53:03 slot1: Changing state: Matched -> Owner
10/15/14 16:53:03 slot1: State change: IS_OWNER is false
10/15/14 16:53:03 slot1: Changing state: Owner -> Unclaimed
10/15/14 16:53:03 slot2: Can't receive eom from schedd
10/15/14 16:53:03 slot2: State change: claiming protocol failed
10/15/14 16:53:03 slot2: Changing state: Matched -> Owner


Can anybody deduce from the "Can't receive eom from schedd" ÃÂerror, where exactly I am messing the configuration up?

Regards,
Nikiforos


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Nikiforos K. Nikiforou
CERN Bat. 51/1-023
CH-1211 GenÃve 23
Switzerland
T: +41 76 487 9495 / 16 9495
Nikiforos.Nikiforou<at>cern.ch