Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] checkpointing the vanilla universeunder windows

Date: Mon, 20 May 2013 19:16:26 +0000
From: Ian Cottam <Ian.Cottam@xxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] checkpointing the vanilla universeunder windows

Could it be a firewall issue?
-Ian
--
Ian Cottam
IT Services Research Lead
IT Services -- supporting research
The University of Manchester
[ATD - Action This Day - Churchill]





On 20/05/2013 18:47, "Antonis Sergis" <sergis_antonis@xxxxxxxxxxx> wrote:

>Hello Ian,
>
>yes I have changed the want_vacate = TRUE in the condor configuration
>file 
>and reconfigured condor. Still the same problem. Apparently the
>"_condor_stdout" file is never created inside the spool directory even
>though the directory is there. Here is the full log file returned up to
>the 
>point that I am forcing a job to vacate a node:
>
>000 (042.000.000) 05/20 18:43:51 Job submitted from host: <155.***>
>...
>001 (042.000.000) 05/20 18:44:11 Job executing on host: <155.***>
>...
>006 (042.000.000) 05/20 18:44:21 Image size of job updated: 750
>    1  -  MemoryUsage of job (MB)
>    112  -  ResidentSetSize of job (KB)
>...
>006 (042.000.000) 05/20 18:44:31 Image size of job updated: 5520
>    1  -  MemoryUsage of job (MB)
>    112  -  ResidentSetSize of job (KB)
>...
>007 (042.000.000) 05/20 18:44:31 Shadow exception!
>    Error from slot1@***: STARTER at 155.***failed to send file(s) to
><155.***>; SHADOW at 155.***failed to write to file
>c:\condor/spool\42\0\cluster42.proc0.subproc0.tmp\_condor_stdout: (errno
>2) 
>No such file or directory
>    179244  -  Run Bytes Sent By Job
>    1528896  -  Run Bytes Received By Job
>...
>012 (042.000.000) 05/20 18:44:31 Job was held.
>    Error from slot1@***: STARTER at 155.***failed to send file(s) to
><155.***>; SHADOW at 155.*** failed to write to file
>c:\condor/spool\42\0\cluster42.proc0.subproc0.tmp\_condor_stdout: (errno
>2) 
>No such file or directory
>    Code 12 Subcode 2
>...
>
>
>-Antonis
>
>-----Original Message-----
>From: Ian Cottam
>Sent: Monday, May 20, 2013 6:17 PM
>To: HTCondor-Users Mail List
>Cc: HTCondor-Users Mail List
>Subject: Re: [HTCondor-users] checkpointing the vanilla universeunder
>windows
>
>Did you check every node has
>Want_vacate = true
>In all local configs? And restart them?
>-Ian
>
>
>
>On 20 May 2013, at 17:04, "Antonis Sergis" <sergis_antonis@xxxxxxxxxxx>
>wrote:
>
>> Hello Ian,
>>
>> thank you for the useful notes. I have now been changing my source code
>>to 
>> be able to extract more info from condor regarding what is going on
>>(added 
>> some counters to count how many times the code has restarted etc).
>>
>> To my surprise, I have intentionally vacated the job to see what
>>happens 
>> and the following error is logged by condor each time a job is
>>evacuated 
>> (asterisks added to protect privacy):
>>
>> Error from slot2@***.uk: STARTER at 155.*** failed to send file(s) to
>> <155.***>; SHADOW at 155.*** failed to write to file
>> c:\condor/spool\48\0\cluster48.proc0.subproc0.tmp\_condor_stdout:
>>(errno 
>> 2) No such file or directory
>>   Code 12 Subcode 2
>>
>> The positive thing though is that I can read and write inside the same
>> file (read the input flag and then I was able to change it).
>>
>> Checking the spool directory, the folder
>> c:\condor/spool\48\0\cluster48.proc0.subproc0.tmp is present but always
>> empty. It appears that at every job vacation the code is not
>>transferring 
>> the files inside the spool directory to be able to restart the code to
>>a 
>> different node. The counters from my source code indicate the same (the
>> code basically cannot be restarted and is completed as long as it is
>>left 
>> running uninterrupted on a single node).
>>
>> Any ideas?
>>
>> P.S. periodic_release=True is to enable more frequent automatic
>>releases 
>> of jobs getting held (I am running around 60k jobs).
>>
>> Antonis
>>
>> -----Original Message----- From: Ian Cottam
>> Sent: Monday, May 20, 2013 1:51 PM
>> To: HTCondor-Users Mail List
>> Subject: Re: [HTCondor-users] checkpointing the vanilla universe under
>> windows
>>
>> We are Linux based here in Manchester, but we took advice from Liverpool
>> who are Windows based, so it should all work.
>>
>> Here is our web page on the topic:
>> 
>><http://condor.eps.manchester.ac.uk/examples/user-level-checkpointing-an-
>>ex
>> ample-in-c/>
>> The has_checkpointing is just a local thing: you don't need it if all
>>your
>> clients are set up identically and to support user checkpointing. To
>>that
>> end, I believe they need
>> WANT_VACATE = True
>> in all their local configs. Then all should work.
>>
>> Here is the Liverpool page
>> <http://condor.liv.ac.uk/checkpoint/>
>>
>> I'm not sure why you have periodic_release=True, we don't use that. But
>>if
>> you could explain it to me, maybe we should be!
>>
>> Hope that helps.
>> -Ian
>>
>>
>> On 20/05/2013 12:41, "Antonis Sergis" <sergis_antonis@xxxxxxxxxxx>
>>wrote:
>>
>>> Dear all,
>>>
>>> I am coming back to the hot ³checkpointing the vanilla universe issue²
>>> under windows. I have a fortran 90 code which can run for a while. For
>>> longer runs, condor¹s performance drops significantly as jobs get
>>> interrupted by users and with the lack of a
>>> native checkpointing function and the inability to use the ³standard"
>>> universe the code has to restart from the beginning on a different
>>> machine. As a result seldom any jobs manage to finish off. I changed my
>>> source code to accommodate a check pointing feature.
>>> The code reads a ³flag² file (which is also one of the initial input
>>> files) and creates a checkpoint file with all the required data to be
>>> able to resume a job from where it was left off. The flag file
>>>initially
>>> contains a ³0². As soon as a given elapsed time
>>> passes (1hr and then every one hour from there onwards) the first
>>> checkpoint takes place. The flag file is supposed to be updated with a
>>> value of ³1² and a ³history² file is created saving the required
>>> checkpoint data. The idea is that when the code gets evicted,
>>> it will read the input file as ³1² and then use the ³history² file to
>>> read the last checkpoint data and resume from where it left off. This
>>> doesn¹t seem to be working. I am quite confused if the flag file gets
>>> updated and re-read upon re-starting the job.
>>> I am also not sure if condor will be able to read the ³history² file
>>> which was created as an output file and is not in the initial input
>>>files
>>> list.
>>>
>>>
>>> Any ideas?
>>>
>>> This is the current submit file I am using to accommodate the
>>>checkpoint
>>> function:
>>>
>>> ************************
>>> ************************
>>> Requirements = (Memory >=900) && (Arch=="X86_64") && (OpSys=="WINDOWS")
>>> Executable = \\htcondor\htcondorjobs\\****\T2\mds.exe
>>> <file://\\htcondor\htcondorjobs\\****\T2\mds.exe>
>>> initialdir = \\htcondor\htcondorjobs\\****\T2
>>> <file://\\htcondor\htcondorjobs\\****\T2>
>>> transfer_input_files = mds.exe, input, flag
>>> Universe = vanilla
>>> Getenv = False
>>> output = Test_cores.out
>>> error = Test_cores.err
>>> log = Test_cores.log
>>> should_transfer_files = ALWAYS
>>> when_to_transfer_output = ON_EXIT_OR_EVICT
>>> periodic_release = TRUE
>>> Queue 250
>>> ************************
>>> ************************
>>>
>>> Regards
>>> Antonis
>>
>>
>> -- 
>> Ian Cottam x61851
>> IT Services Research Lead
>> IT Services -- supporting research
>> The University of Manchester
>> [ATD - Action This Day - Churchill]
>>
>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>with 
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>with 
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>_______________________________________________
>HTCondor-users mailing list
>To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
>The archives can be found at:
>https://lists.cs.wisc.edu/archive/htcondor-users/
>
>_______________________________________________
>HTCondor-users mailing list
>To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
>The archives can be found at:
>https://lists.cs.wisc.edu/archive/htcondor-users/

Follow-Ups:
- Re: [HTCondor-users] checkpointing the vanilla universeunder windows
  - From: Antonis Sergis

References:
- Re: [HTCondor-users] checkpointing the vanilla universeunder windows
  - From: Antonis Sergis

Prev by Date: Re: [HTCondor-users] Jobs do not execute, they sit idle in the queue indefinitely
Next by Date: Re: [HTCondor-users] Jobs do not execute, they sit idle in the queue indefinitely
Previous by thread: Re: [HTCondor-users] checkpointing the vanilla universeunder windows
Next by thread: Re: [HTCondor-users] checkpointing the vanilla universeunder windows
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] checkpointing the vanilla universeunder windows