[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Standard Universe blues...



It sounds like you want the checkpointing facility of "standard" universe
without the shadow/IO handling abilities (actually quite useful from
a security point of view).

I am sure there was one way of getting this, but can't remember the best
way to do it. I am sure someone else will post it. In the meantime ...

Is it possible to use:
===
  local_files = file1,file2,...

    If your job attempts to access a file mentioned in this list, Condor will cause
    it to be read or written at the execution machine. This is most useful for
    temporary files not used for input or output. This list uses the same syntax
    as compress_files, shown above. 

    local_files = /tmp/*

    This option only applies to standard-universe jobs. 
===

for instance you could create a tmp dir in your execute directory and
have local_files = tmp/*       # I hope relative names are OK
adding tmp/* to your transfer_output_files might be possible to avoid
copying them up a level, but use of that option brings with it warnings in the manual.

Cheers

JK

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Angel de Vicente
> Sent: Friday, December 23, 2005 12:56 PM
> To: Condor-Users Mail List
> Subject: [Condor-users] Standard Universe blues...
> 
> 
> Hi all,
> 
> I've been struggling with this problem for a few days now... 
> I wonder if anyone
> would be able to suggest a possible solution.
> 
> We've got a fortran code that we have compiled with the 
> Condor libraries. It
> runs OK in the standard universe, but very slow, since it is 
> very I/O intensive
> (reads/writes around 70GB): in local it takes around 25 hours 
> and with Condor
> around 80-90.
> 
> I thought a possible solution was to use the option fetch_files
> 
>        fetch_files = file1, file2, ...
> 
>           If your job attempts to access a file mentioned in 
> this list, Condor
>           will  automatically  copy  the  whole file to the 
> executing machine,
>           where it can be accessed quickly. When your job 
> closes the file,  it
>           will  be  copied  back  to its original location. 
> This list uses the
>           same syntax as compress_files, shown above.
> 
>           This option only applies to standard-universe jobs.
> 
> 
> but this is no good, as the files are copied back every time 
> the file is
> closed, which happens many many times. Ideally I would like 
> something like this,
> but that only copies the files when a checkpoint is made, so 
> that in the 10
> hours or so between evictions a lot of progress can be done 
> on local files and
> not through the network.
> 
> I've tried to fool the system by submitting it as a vanilla 
> job (though
> condor_compiled), wrapped in a script that will manually 
> checkpoint the job
> every 60 minutes or so, but I'm having troubles with this as 
> well. (The
> remaining problem right now is that I launch the process as a 
> background job,
> and then Condor considers this as a non-Condor load and the 
> job gets suspended
> and then evicted continuosly). Does anybody have an example 
> of a script that is
> able to regularly make a checkpoint of a program, plus move 
> around some files,
> and then continue until the job is evicted?
> 
> Thanks a lot,
> Angel de Vicente
> -- 
> ----------------------------------
> http://www.iac.es/galeria/angelv/
> 
> PostDoc Software Support
> Instituto de Astrofisica de Canarias
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>