[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] ACCESS_VIOLATION under Windows



Hi All,

I'm experiencing a problem setting up a windows box as a condor execute node - specifically to execute java jobs.

I have a windows box running xp sp2. It is purely set up as an execute node. The start deamon picks successfully picks up the job and attempts to execute it. It spawns the condor_starter - but the condor_starter seems to crash with an exception (an ACCESS_VIOLATION).

As you can see in log below, the starter process seems to try to launch java, but this ends in an exception? The starter crashes immediately after that last log. I've confirmed that java exists at the location specified etc.

I assume this might be some sort of windows security issue, but I'm not sure how to debug it. The condor vm user was given rights to execute the java directory - though i'm not sure whether this is enough.

Any help or tips for debugging are most welcome.

-james


Start Log
-------------

7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: setting sock->decode()
7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: allowing an empty message for sock.
7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: Success.
7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: Command received via UDP from host < 172.19.189.3:9629>
7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: received command 60011 (DC_NOP), calling handler (handle_nop())
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:3) (pid:3636) Calling Handler <HandleDC_SERVICEWAITPIDS()> for Signal 60009 <DC_SERVICEWAITPIDS>
7/25 16:04:52 (fd:3) (pid:3636) KEYCACHEX: removing session hp-test-02:3636:1185343491:6 for <172.19.189.3:9618 >
7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: pid 3940 exited with status -1073741819, invoking reaper 1 <reaper>
7/25 16:04:52 (fd:3) (pid:3636) Starter pid 3940 died on signal -1073741819 (exception ACCESS_VIOLATION)
7/25 16:04:52 (fd:3) (pid:3636) Entering ProcFamily::hardkill
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\killfamily.C:274
7/25 16:04:52 (fd:3) (pid:3636) Destroying Daemon object:
7/25 16:04:52 (fd:3) (pid:3636) Type: 1 (any), Name: (null), Addr: < 172.19.189.3:9611>
7/25 16:04:52 (fd:3) (pid:3636) FullHost: (null), Host: (null), Pool: (null), Port: -1
7/25 16:04:52 (fd:3) (pid:3636) IsLocal: N, IdStr: (null), Error: (null)
7/25 16:04:52 (fd:3) (pid:3636)  --- End of Daemon object info ---
7/25 16:04:52 (fd:3) (pid:3636) ProcAPI: pid # 3940 was not found (OpenProcess err=1308)
7/25 16:04:52 (fd:3) (pid:3636) ProcAPI: pid # 3940 was not found (OpenProcess err=1308)
7/25 16:04:52 (fd:3) (pid:3636) ProcFamily: parent: 3940 family:
7/25 16:04:52 (fd:3) (pid:3636) ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 3624k
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\killfamily.C:475
7/25 16:04:52 (fd:3) (pid:3636) Attempting to remove C:\condor\execute\dir_3940 as SuperUser (system)
7/25 16:04:52 (fd:3) (pid:3636) Deleted ProcFamily w/ pid 3940 as parent
7/25 16:04:52 (fd:3) (pid:3636) State change: starter exited
7/25 16:04:52 (fd:3) (pid:3636) Changing activity: Busy -> Idle
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:3) (pid:3636) In cancel_timer(), id=66
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:3) (pid:3636) In DaemonCore Timeout()
7/25 16:04:52 (fd:3) (pid:3636)

Starter Log
----------------
7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: setting sock->decode()
7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: allowing an empty message for sock.
7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: Success.
7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: Command received via UDP from host < 172.19.189.3:9614>
7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: received command 60011 (DC_NOP), calling handler (handle_nop())
7/25 16:04:51 (fd:8) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:51 (fd:8) (pid:3940) Calling Handler <HandleDC_SERVICEWAITPIDS()> for Signal 60009 <DC_SERVICEWAITPIDS>
7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: tid 3300 exited with status 1, invoking reaper 2 <FileTransfer::Reaper()>
7/25 16:04:51 (fd:8) (pid:3940) File transfer completed successfully.
7/25 16:04:51 (fd:6) (pid:3940) Destroying Daemon object:
7/25 16:04:51 (fd:6) (pid:3940) Type: 1 (any), Name: (null), Addr: <172.19.189.3:9618>
7/25 16:04:51 (fd:6) (pid:3940) FullHost: (null), Host: (null), Pool: (null), Port: -1
7/25 16:04:51 (fd:6) (pid:3940) IsLocal: N, IdStr: (null), Error: (null)
7/25 16:04:51 (fd:6) (pid:3940)  --- End of Daemon object info ---
7/25 16:04:52 (fd:6) (pid:3940) Calling client FileTransfer handler function.
7/25 16:04:52 (fd:6) (pid:3940) in DaemonCore NewTimer()
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> Timers
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> ~~~~~~
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 7, when = 1185343492, period = 0, handler_descrip=<deferred job start>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 6, when = 1185343551, period = 0, handler_descrip=<dc_touch_log_file>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 3, when = 1185343731, period = 240, handler_descrip=<self_monitor>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 1, when = 1185343791, period = 300, handler_descrip=<check_session_cache>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 5, when = 1185344661, period = 1170, handler_descrip=<DaemonCore::SendAliveToParent>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 2, when = 1185345292, period = 1801, handler_descrip=<handle_cookie_refresh>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 4, when = 1185372290, period = 0, handler_descrip=<DaemonCore::ReInit()>
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) leaving DaemonCore NewTimer, id=7
7/25 16:04:52 (fd:6) (pid:3940) Job 71.0 set to execute immediately
7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:6) (pid:3940) In DaemonCore Timeout()
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> Timers
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> ~~~~~~
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 7, when = 1185343492, period = 0, handler_descrip=<deferred job start>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 6, when = 1185343551, period = 0, handler_descrip=<dc_touch_log_file>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 3, when = 1185343731, period = 240, handler_descrip=<self_monitor>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 1, when = 1185343791, period = 300, handler_descrip=<check_session_cache>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 5, when = 1185344661, period = 1170, handler_descrip=<DaemonCore::SendAliveToParent>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 2, when = 1185345292, period = 1801, handler_descrip=<handle_cookie_refresh>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 4, when = 1185372290, period = 0, handler_descrip=<DaemonCore::ReInit()>
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore: Calling handler for Timer 7 (deferred job start)
7/25 16:04:52 (fd:6) (pid:3940) Starting a JAVA universe job with ID: 71.0
7/25 16:04:52 (fd:6) (pid:3940) In OsProc::OsProc()
7/25 16:04:52 (fd:6) (pid:3940) Main job KillSignal: 15 (Unknown)
7/25 16:04:52 (fd:6) (pid:3940) Main job RmKillSignal: 15 (Unknown)
7/25 16:04:52 (fd:6) (pid:3940) Main job HoldKillSignal: 15 (Unknown)
7/25 16:04:52 (fd:6) (pid:3940) SYSAPI_GET_LOADAVG is undefined, using default value of True
7/25 16:04:52 (fd:6) (pid:3940) JavaProc: Cmd="C:\\Program Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE"
7/25 16:04:52 (fd:6) (pid:3940) JavaProc: Args=-Xmx247m -classpath C:\condor/lib;C:\condor/lib/scimark2lib.jar;. -Dchirp.config=C:\condor\execute\dir_3940\chirp.config CondorJavaWrapper C:\condor\execute\dir_3940\jvm.start C:\condor\execute\dir_3940\jvm.end JavaTest
7/25 16:04:52 (fd:6) (pid:3940) in VanillaProc::StartJob()
7/25 16:04:52 (fd:6) (pid:3940) in OsProc::StartJob()
7/25 16:04:52 (fd:6) (pid:3940) IWD: C:\condor/execute\dir_3940
7/25 16:04:52 (fd:6) (pid:3940) get_port_range - (LOWPORT,HIGHPORT) is (9600,9700).
7/25 16:04:52 (fd:6) (pid:3940) TokenCache contents:
condor-reuse-vm1@.
7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_USER at ..\src\condor_starter.V6.1\os_proc.C:227
7/25 16:04:52 (fd:7) (pid:3940) Input file: NUL
7/25 16:04:52 (fd:8) (pid:3940) Output file: C:\condor/execute\dir_3940\JavaTest.output.0
7/25 16:04:52 (fd:9) (pid:3940) Error file: C:\condor/execute\dir_3940\JavaTest.error.0
7/25 16:04:52 (fd:9) (pid:3940) Doing CONDOR_begin_execution
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfds=0
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfound=1
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfds=0
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfound=1
7/25 16:04:52 (fd:9) (pid:3940) Renice expr "10" evaluated to 10
7/25 16:04:52 (fd:9) (pid:3940) About to exec C:\condor/execute\dir_3940\"C:\\Program Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE" -Xmx247m -classpath C:\condor/lib;C:\condor/lib/scimark2lib.jar;. - Dchirp.config=C:\condor\execute\dir_3940\chirp.config CondorJavaWrapper C:\condor\execute\dir_3940\jvm.start C:\condor\execute\dir_3940\jvm.end JavaTest
7/25 16:04:52 (fd:9) (pid:3940) Env = _CONDOR_SCRATCH_DIR=C:\condor\execute\dir_3940 _CONDOR_HIGHPORT=9700 _CONDOR_LOWPORT=9600
7/25 16:04:52 (fd:9) (pid:3940) JOB_INHERITS_STARTER_ENVIRONMENT is undefined, using default value of False
7/25 16:04:52 (fd:9) (pid:3940) PRIV_USER --> PRIV_CONDOR at ..\src\condor_starter.V6.1\os_proc.C:343
7/25 16:04:52 (fd:9) (pid:3940) In DaemonCore::Create_Process(C:\condor/execute\dir_3940\"C:\\Program Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE",...)