[ale] mass file modifcation

Jim Kinney jim.kinney at gmail.com
Sun Mar 30 22:21:03 EDT 2008


So "cleaning up bad MS-HTML" did not include "unlink $crapfile". Your
patience and tolerance is astounding :-)

On Sun, Mar 30, 2008 at 6:19 PM, Mike Harrison <meuon at geeklabs.com> wrote:

> Jim
> > I need to update about 43k files and sed just won't cut it for this
> > task.   What I need to do is replace 2 lines with 4 new ones, and the
> > lines contain URLs (backslashes, brackets, etc.).  What I would like
> > to do is put the new text in a file and pass it and the search text to
> > some program that will modify all the files.   Any ideas on whats
> > available to do that?
>
> I've not done as much of this as I used to fixing mailQ's and such
> at an ISP, but I always ended up doing it in PERL.
> Often with a switch for doing 10 files, writing the changed files
> in /tmp so I could manually verify them before bulk changing hundreds of
> thousands (or more) files. I'm not as good with find/sed/awk, but one of
> the reasons I was doing things like this on Perl is it worked well
> when there were lots of files in a single directory, and shell scripting
> couldn't handle the lists of files well.
>
> I also often found it easier to write and debug complex regex's in perl
> as several steps. Regex's are incredible, and powerful,
> and really easy to do things that you didn't realize with exceptions.
>
> I don't have my old perl scripts from those days,
>
> But they all had something like what is below (which cleans up bad
> MS-HTML):
> (note, the character encoding in the regex's didn't cut and past well into
> e-mail:
>
> -------------------------------------------------------------------------------------------
> opendir(INC,"$dd") ;
> print "Opening: $dd" ;
> @incfiles = readdir(INC) ;
> closedir INC ;
> foreach(sort @incfiles) {
>   if(/^\./ ) { } else {
>       if(/(.*).html/ ) {
>           $file = $_ ;
>           fixheader($file) ;
>           #sleep 1 ;  # let the server breath. Optional.
>       } ;
>   };
> } ;
>
> sub fixheader($file) {
>  $page = '' ;
>  $body = 'F' ;
>  open(IN,"$dd/$file") ;
>   while(<IN>) {
>     if(/\<body/) { $body = "T" ; } ; #don't process headers..
>     if($body eq "T") {
>       $page .= $_ ;
>     } ;
>   } ; # end while IN
>   close IN ;
>   $page =~ s/M//g ;       #deletes cr's
>   $page =~ s/\&\#13;/[\[P\]\]/g ; #turns encoded CR's into <P>
>   $page =~ s/\U/\[[li]]/g ; # NOTE X is Magic Char 95.   Turns into
> bullets/listed items
>   $page =~ s/\n//g ;   # deletes lf's
>   #lots more of these..
>   open(OUT,">$dd/$file.new") ;
>   print OUT $page
>   close OUT ;
> } ;
>
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
>



-- 
-- 
James P. Kinney III
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.ale.org/pipermail/ale/attachments/20080330/f6c7fbb8/attachment.html 


More information about the Ale mailing list