Re: [Condor-users] i have a problem

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

hi friend.

i've been read in the condor list the situation attached in the mail. i have the same problem but i don't found which was the solution about it. can somebody help me.

best regards,

victor hinojosa

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Matt Hope
> Sent: Friday, December 09, 2005 3:57 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] loadavg thread died, restarting. (exit
> code=2)
>
>
> On 12/8/05, Orchard, Bob <Robert.Orchard@xxxxxxxxxxxxxx> wrote:
> >
> > Running on Windows 2000. Condor client version 6.6.10.
>
> On the 64 bit windows platform this is a known bug (it is todo with
> adding performance counters) though in that instance you get an exit
> code of 1 not 2 so it appears to be getting past initializing the WMI
> perf counters query but is not able to add a counter.
>
> Definitely keep on 6.6.10 since on 6.6.8 and below this would cause a
> massive memory leak in your startd.
>
> > When a job is NOT running I get the following messages
> every 5 minutes or so.
> >
> > 11/30 10:23:22 loadavg thread died, restarting. (exit code=2)
> > 11/30 10:23:27 no loadavg samples this minute, maybe thread died???
>
> This suggests it is failing to add a counter to "\\System\\Processor
> Queue Length" but this is likely to not be specific to that counter
> and more likely to be a general issue polling the perf counters.
>
> Is your "Windows Management Instrumentation" service running (is it
> set to disabled?)

This service is running

>
> > 12/6 11:24:03 ProcFamily::currentfamily: ERROR: family_size is 0
> > 12/6 11:24:03 WARNING: No processes found in starter's family
>
> This is not necessarilyl a critical error, if your job creates a lot
> of short lived processes it could just have stale data
>

The jobs submitted are quite long and each runs for 40 minutes to 2 hours

>
> > Has anyone had this problem or does anyone know what the
> source of the
> > problem could be? It seems specific to my machine and not
> others in our pool.
> >
> > Some supplemental information. My machine sometimes also allows
> > more than 1 job to be scheduled at the same time. So I end
> up with many
> > sub-directories under condor/execute. I've had up to 65 directories
> > created and many of these were the same job running at the
> same time.
> > Output from StarterLog file below shows the same job being
> started within
> > 30 seconds and both running at the same time. This is not
> > supposed to happen.
>
> That looks bad, are you running multiple startd's on this
> machine by accident.

No there are not multiple startd's but there are many condor_exec.exe processes

>
> What does your process list show for executables starting
> with "condor_"

Not running right now and I didn't capture that but I'm quite certain that there
was just the master, schedd, and startd plus the condor_exec processes

>
> > A second bit of information that may be relevant. It is
> possible that some
> > time ago when I was cleaning up user accounts, that I
> deleted the condor_reuse_vm1
> ...
> > I've installed and uninstalled condor several times to try
> to get rid of this
> > unusual problem
>
> Sounds very unhappy. is the machine suffering from other
> issues/symptoms? Is is SMP/Hyperthreaded?

No the machine behaves quite well.

>
> Have you considered the 'nuclear' option of reinstalling from
> the OS up...
>

I've thought that this might be the only option but it is a significant
effort to get back to my current state and I'll only do this if I
think this workstation is critical to the condor pool. I was hoping
for a simple fix ...

> Matt

my file .sub is the following:

universe = vanilla
executable = sim_rebounding_DT.exe
requirements = Memory >= 128
rank = kflops

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

transfer_input_files = meas.txt

error = FIR_SRA.err
log = FIR_SRA.log

output = FIR_SRA_cmeans_432.txt
arguments = 0 0
queue

output = FIR_SRA_cmeans_422.txt
arguments = 0 1
queue

when i submit this file to condor pool the command condor_status show it:

C:\thesis\simulation>condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

vm1@dtc-mvill WINNT51     INTEL Unclaimed Idle       0.000   251 0+02:08:28
vm2@dtc-mvill WINNT51     INTEL Unclaimed Idle       0.000   251 0+02:08:29
dtc-snaranjo. WINNT51     INTEL Unclaimed Idle       0.030   478 0+02:03:27
dtc-vhinojosa WINNT51     INTEL Claimed    Busy       0.000 1015 0+00:02:21
id-vhinojosa. WINNT51     INTEL Unclaimed Idle       0.840   254 0+02:05:22

Machines Owner Claimed Unclaimed Matched Preempting

INTEL/WINNT51 5 0 1 4 0 0

Total 5 0 1 4 0 0

The the dtc-vhinojosa is running with the job, but due to my constraint the next machine is dtc-snaranjo, but i don't know because it doesn't run the job.

i use the command -analyze and the result is the next:

C:\thesis\simulation>condor_q -analyze

-- Submitter: dtc-vhinojosa : <10.0.1.171:4685> : dtc-vhinojosa
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
---
045.000: Request is being serviced

---
045.001: Run analysis summary. Of 5 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      1 match, but are serving users with a better priority in the pool
      4 match, match, but reject the job for unknown reasons
      0 match, but will not currently preempt their existing job
      0 are available to run your job

can i help me which could be the reasons for the condor show the message "reject the job for unknown reasons" or where i can search the mistake?

thanks for your help

regards,

victor

De: condor-users-bounces@xxxxxxxxxxx en nombre de David A. Kotz
Enviado el: Jue 01/06/2006 02:40 p.m.
Para: Condor-Users Mail List
Asunto: Re: [Condor-users] i have a problem

Victor,

The first step is to use the -analayze switch to condor_q. Try using
this command on the submit node:

   condor_q -analyze 26.0

and also this one (if it works in Windows):

   condor_q -better-analzye 26.0

Those commands should give you some indication of why job 26.0 is not
starting.

If you get nothing useful from those commands, compare long listings of
the jobs and the machines:

   condor_q -l 26.0
   condor_status -l dtc-mvill

to see if you can spot incompatibilities between the job's requirements
and the machine's requirements.

- dave

Víctor Hinojosa wrote:
> i have a condor pool. the summary is the following:
>
> Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
> vm1@dtc-mvill <mailto:vm1@dtc-mvill> WINNT51     INTEL Unclaimed Idle       0.000   508 0+02:08:29
> vm2@dtc-mvill <mailto:vm2@dtc-mvill> WINNT51     INTEL Unclaimed Idle       0.330   508 0+02:08:30
> dtc-vhinojosa WINNT51     INTEL Unclaimed Idle       0.000 1015 0+00:08:06
> id-vhinojosa. WINNT51     INTEL Unclaimed Idle       0.010   254 0+02:08:09
>                      Machines Owner Claimed Unclaimed Matched Preempting
>        INTEL/WINNT51        4     0       0         4       0          0
>                Total        4     0       0         4       0          0
>
> i submit a task with condor_submit. i check the status of my job with condor_q command.
>
> -- Submitter: dtc-vhinojosa : <10.0.1.171:2934> : dtc-vhinojosa
> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>   26.0   Victor          6/1 13:02   0+00:00:00 I 0   0.3 sim_rebounding_DT
>   26.1   Victor          6/1 13:02   0+00:00:00 I 0   0.3 sim_rebounding_DT
>   26.2   Victor          6/1 13:02   0+00:00:00 I 0   0.3 sim_rebounding_DT
>   26.3   Victor          6/1 13:02   0+00:00:00 I 0   0.3 sim_rebounding_DT
>   26.4   Victor          6/1 13:02   0+00:00:00 I 0   0.3 sim_rebounding_DT
>   26.5   Victor          6/1 13:02   0+00:00:00 I 0   0.3 sim_rebounding_DT
>   26.6   Victor          6/1 13:02   0+00:00:00 I 0   0.3 sim_rebounding_DT
>   26.7   Victor          6/1 13:02   0+00:00:00 I 0   0.3 sim_rebounding_DT
>   26.8   Victor          6/1 13:02   0+00:00:00 I 0   0.3 sim_rebounding_DT
>   26.9   Victor          6/1 13:02   0+00:00:00 I 0   0.3 sim_rebounding_DT
> 10 jobs; 10 idle, 0 running, 0 held
>
> when i install the condor pool i set up all machines with the option "always run Condor jobs".so i don't know what happen. somebody can help me or where i can search the mistake?
>
> regards,
>
>
> victor hinojosa
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

Mailing List Archives

Public Access

Re: [Condor-users] i have a problem