[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor - guru needed!!!! (testing app)



further update!!

it appears that i've actually gotten my test up/running and it appears that
i can confirm that the test perl apps are running on both the master/client
node within the test Condor setup that i have.

however, when i look at the "Condor_q" output, it only shows that two
processes are running at a time, which i imagine equates to a process
running on each machine (the master and client).

i'd like to have multiple instances running in parallel on both machines..
any idea/pointers as to how to make this happen??

i should easily be able to have 10-20 of these test apps running in parallel
on each of my test machines...

thanks

-bruce



update...

hi. this is in continuation to my getting a two node Condor up an testing.

i performed a:
  Condor_submit stest.sub

i then did:
  Condor_q
where i see the queued up test pl scripts. however, i see:
-- Submitter: laptop2.mesa.com : <192.168.1.33:56278> : laptop2.mesa.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   test            7/31 08:47   0+00:00:00 I  0   9.8  stest.pl
   9.1   test            7/31 08:47   0+00:00:00 I  0   9.8  stest.pl
   9.2   test            7/31 08:47   0+00:00:00 I  0   9.8  stest.pl
   9.3   test            7/31 08:47   0+00:00:00 I  0   9.8  stest.pl
   9.4   test            7/31 08:47   0+00:00:00 I  0   9.8  stest.pl
   9.5   test            7/31 08:47   0+00:00:00 I  0   9.8  stest.pl
 .
 .
 .

which seems to imply that something's wrong in my config files...

i'm pretty sure that whatever is wrong is rather simple/subtle!

is there a Condor guru that I can talk to for a few minutes on this..

my basic needs are to:
 1) allow any user to submit a job
 2) allow each job to run as fast as possible on the network/machine
 3) allow multiple jobs to run on a given machine at the same time
 4) track which jobs/apps run on which machine

i want to get/submit a job/app and throw it on the network to run as fast as
possible, which means i want to run multiple apps on the same machine at the
same time... Condor should be great for this, if i could get my hands around
how to properly configure it!

thanks

-bruce



hi...

this is further continuance of my testing with condor.

i've been able to get a sample app running with a 2 node system. i can do
'condor_submit' from both the master/child node and i see both machines.

the condor_config file for both machines is pretty much the sample file,
with limited changes. using the sample, my test apps appear to have a
wait/delay of 5 mins. my goal is to be able to run as many apps as fast as i
possibly can, on the machines in the network.. i'd also like to be able to
see what machines the app(s) are actually running on...

i tried to run the test function listed in the 'condor_config' file, using:

   ##  Replace UWCS_* with TESTINGMODE_* if you wish to do testing mode.

i also used the following:
  StartIdleTime		= 2 * $(MINUTE)
  ContinueIdleTime	=  $(MINUTE)
  MaxSuspendTime		= 1 * $(MINUTE)
  MaxVacateTime		= 1 * $(MINUTE)

in an attempt to try to run as fast as possible during the tests.

my test doesn't run, instead, the StartLog indicates that I have some kind
of an error. a sample of the StartLog contents is listed below. as i
indicated, the test submit app i'm running has run successfully with the
initial condor_config file, prior to my changes...

any thoughts/suggestions/help would be appreciated!!

thanks

-bruce


sample StartLog contents...
7/30 23:51:40 match_info called
7/30 23:51:40 Received match <192.168.1.33:42714>#1154324088#25
7/30 23:51:40 State change: match notification protocol successful
7/30 23:51:40 Changing state: Unclaimed -> Matched
7/30 23:51:41 DaemonCore: PERMISSION DENIED to unknown user from host
<192.168.1.55:33433> for command 442 (REQUEST_CLAIM)
7/30 23:51:41 DaemonCore: PERMISSION DENIED to unknown user from host
<192.168.1.55:33062> for command 443 (RELEASE_CLAIM)
7/30 23:53:40 State change: match timed out
7/30 23:53:40 Changing state: Matched -> Owner
7/30 23:53:40 State change: IS_OWNER is false
7/30 23:53:40 Changing state: Owner -> Unclaimed
7/30 23:56:41 DaemonCore: Command received via UDP from host
<192.168.1.55:33073>
7/30 23:56:41 DaemonCore: received command 440 (MATCH_INFO), calling handler
(command_match_info)
7/30 23:56:41 match_info called
7/30 23:56:41 Received match <192.168.1.33:42714>#1154324088#27
7/30 23:56:41 State change: match notification protocol successful
7/30 23:56:41 Changing state: Unclaimed -> Matched
7/30 23:56:41 DaemonCore: PERMISSION DENIED to unknown user from host
<192.168.1.55:33458> for command 442 (REQUEST_CLAIM)
7/30 23:56:41 DaemonCore: PERMISSION DENIED to unknown user from host
<192.168.1.55:33073> for command 443 (RELEASE_CLAIM)
7/30 23:58:41 State change: match timed out
7/30 23:58:41 Changing state: Matched -> Owner
7/30 23:58:41 State change: IS_OWNER is false
7/30 23:58:41 Changing state: Owner -> Unclaimed