[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor on X86_64 no run works



Hi.

Kewley, J (John) escribió:
> Some thoughts:
>
> 1. You mention "flock". You shouldn't need this if you just have a
> single pool.
>   
Yes, I know, there are 2 pool.
But the second pool is unknow for me. Only I know that I have to
activate flock.


> 2. I notice you have vm1, vm2 ... vm5 mentioned, that implies more than
> 4 processors
>    per node, you might have hyperthreading turned on, in which case
> condor will register
>    (possibly) 8 slots per node.
>   
OK, 2 quad core = 8 cpu


> 3. Have you tried
>    condor_q -anal
>    or
>    condor_q -better-anal
>    to see why it isn't matching?
>   
gargamel:~ # condor_q -analyze
Error: Could not connect to negotiator ((null))

before work, now no, I'm searching in google.

But condor_q :

-- Submitter: gargamel.localdomain : <XXXXXXXXXX:38974> :
gargamel.localdomain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  41.0   condor         11/5  13:20   0+00:00:00 I  0   9.8  for
  41.1   condor         11/5  13:20   0+00:00:00 I  0   9.8  for
  41.2   condor         11/5  13:20   0+00:00:00 I  0   9.8  for
  41.3   condor         11/5  13:20   0+00:00:00 I  0   9.8  for
  41.4   condor         11/5  13:20   0+00:00:00 I  0   9.8  for
  42.0   condor         11/5  15:21   0+00:00:00 I  0   9.8  loop
  42.1   condor         11/5  15:21   0+00:00:00 I  0   9.8  loop
  42.2   condor         11/5  15:21   0+00:00:00 I  0   9.8  loop
  42.3   condor         11/5  15:21   0+00:00:00 I  0   9.8  loop
  42.4   condor         11/5  15:21   0+00:00:00 I  0   9.8  loop




> 4. You do a "queue 5", but all the jobs write to the same error and
> output files,
>    this may not be what is desired. To write to different ones, use
> something like
>    output = loop$(PROCESS).out
>    error = loop$(PROCESS).err
>   
OK, thanks,

After submit loop.sub  I have 5 files err and 5 files out
--------------> empty all


> 5. I can't see a
>    log = loop.log
>    line, this is useful - have a look in there to see what is produced.
>    [Note: don't use $(PROCESS) for this one
>   
OK, thanks

000 (042.000.000) 11/05 15:21:08 Job submitted from host: <MY_IP:38974>
...
000 (042.001.000) 11/05 15:21:08 Job submitted from host: <MY_IP:38974>
...
000 (042.002.000) 11/05 15:21:08 Job submitted from host: <MY_IP:38974>
...
000 (042.003.000) 11/05 15:21:08 Job submitted from host: <MY_IP:38974>
...
000 (042.004.000) 11/05 15:21:08 Job submitted from host: <MY_IP:38974>

> 6. Have a look in the SchedLog of your submit node to see what is in
> there
>   
last 50 lines after run loop.sub

11/5 15:18:32 (pid:4513)
******************************************************
11/5 15:18:33 (pid:4513) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
11/5 15:18:33 (pid:4513) ** /home/condor/sbin/condor_schedd
11/5 15:18:33 (pid:4513) ** $CondorVersion: 6.8.6 Sep 13 2007 $
11/5 15:18:33 (pid:4513) ** $CondorPlatform: I386-LINUX_DEBIAN40 $
11/5 15:18:33 (pid:4513) ** PID = 4513
11/5 15:18:33 (pid:4513) ** Log last touched 11/5 15:09:11
11/5 15:18:34 (pid:4513)
******************************************************
11/5 15:18:34 (pid:4513) Using config source: /home/condor/condor_config
11/5 15:18:34 (pid:4513) Using local config sources:
11/5 15:18:34 (pid:4513)    /home/condor/etc/gargamel.local
11/5 15:18:34 (pid:4513) DaemonCore: Command Socket at <MI_IP:38974>
11/5 15:18:35 (pid:4513) History file rotation is enabled.
11/5 15:18:35 (pid:4513)   Maximum history file size is: 20971520 bytes
11/5 15:18:35 (pid:4513)   Number of rotated history files is: 2
11/5 15:18:36 (pid:4513) Sent ad to central manager for condor@localdomain
11/5 15:18:37 (pid:4513) Sent ad to 3 collectors for condor@localdomain
11/5 15:18:41 (pid:4513) GCB: [GCB_connect(17)]<192.168.3.100:9618>:
direct connect using _CB_do_connect failed
11/5 15:18:41 (pid:4513) attempt to connect to <192.168.3.100:9618>
failed: Transport endpoint is already connected (connect errno = 106). 
Will keep trying for 20 total seconds (15 to go).

11/5 15:21:11 (pid:4513) DaemonCore: Command received via UDP from host
<MY_IP:32788>
11/5 15:21:11 (pid:4513) DaemonCore: received command 421 (RESCHEDULE),
calling handler (reschedule_negotiator)
11/5 15:21:11 (pid:4513) Sent ad to central manager for condor@localdomain
11/5 15:21:12 (pid:4513) Sent ad to 3 collectors for condor@localdomain
11/5 15:21:12 (pid:4513) Called reschedule_negotiator()
11/5 15:21:12 (pid:4513) failed to send RESCHEDULE command to negotiator
11/5 15:23:39 (pid:4513) DaemonCore: PERMISSION DENIED to unknown user
from host <MY_IP:59285> for command 493 (NEGOTIATE_WITH_SIGATTRS)

> 7. Are these nodes on a cluster, i.e. on a private network, if so then
> you
>    will need full connectivity between all submit nodes and all execute
> nodes.
>    See paper and presentation on
>    http://epubs.cclrc.ac.uk/work-details?w=34452
>    for more details
>
> Good luck
>   
I read now, thanks
> JK
>
>   
>> -----Original Message-----
>> From: condor-users-bounces@xxxxxxxxxxx 
>> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of jmferrer
>> Sent: Monday, November 05, 2007 12:34 PM
>> To: condor-users@xxxxxxxxxxx
>> Subject: [Condor-users] Condor on X86_64 no run works
>>
>> Hi.
>>
>> I'm trying build a Cluster with:
>>
>>     OpenSuse 10.2
>>     Condor-6.8.6
>>     Kernel suse 2.6.18.2-34-default
>>
>>
>> System:
>>
>> 1 Central Manager  1cpu x P4 ----------> no execute and yes flock
>> 19 nodes 2 quadcore inet X86_64
>>
>> I share /home in Central manger (for all nodes NFS)
>>
>> If I run condor_status
>>
>> gargamel:/home/condor # condor_status
>>
>> Name          OpSys       Arch   State      Activity   LoadAv Mem  
>> ActvtyTime
>>
>> vm1@smurf0 LINUX       X86_64 Owner      Idle       0.000   
>> 996  0+00:06:45
>> vm2@smurf0 LINUX       X86_64 Unclaimed  Idle       0.000   
>> 996  4+23:45:04
>> vm3@smurf0 LINUX       X86_64 Unclaimed  Idle       0.000   
>> 996  4+23:45:05
>> vm4@smurf0 LINUX       X86_64 Unclaimed  Idle       0.000   
>> 996  4+23:45:07
>> vm5@smurf0 LINUX       X86_64 Unclaimed  Idle       0.000   
>> 996  4+23:45:08
>> ..............................
>>                Total    87     1       0        86       0         
>> 0        0
>>
>> some nodes is off
>>
>> My submit file
>> gargamel:/home/condor # cat /home/pepe/test_condor/loop.submit
>> #archivo de descripcion generado automaticamente universe = 
>> vanilla executable = loop output = loop.out error = loop.err
>> Requirements   = (Arch =="INTEL" && OpSys == "LINUX") || \
>>                  (Arch =="X86_64" && OpSys == "LINUX") queue 5
>>
>>
>>
>>
>> somebody can show me how do work this?
>>
>>
>>
>> Sorry for my englis, I'm from almeria IR.
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to 
>> condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at: 
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>>     
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
>