[ale] Riddle me this awk man

Richard Bronosky Richard at Bronosky.com
Mon Feb 21 19:12:41 EST 2011


If it's not sensitive information, I'd love to get my hands on a
gzipped tar with the original file, the script, the outputs, etc. I
really drill into this stuff. I use awk everyday of my life. I would
like to know if there issues.

On Mon, Feb 21, 2011 at 7:07 PM, Greg Freemyer <greg.freemyer at gmail.com> wrote:
> On Thu, Feb 17, 2011 at 8:41 PM, Greg Freemyer <greg.freemyer at gmail.com> wrote:
>> On Thu, Feb 17, 2011 at 8:15 PM, Geoffrey Myers
>> <lists at serioustechnology.com> wrote:
>>> Greg Freemyer wrote:
>>>> It works in cygwin!!!!!!
>>>>
>>>> That may be a first for me.  A linux bug that does not exist in the
>>>> cygwin version.
>>>>
>>>> It might be a awk vs. gawk thing.  I'll worry with it tomorrow.
>>>
>>> Then I'd say it's a bug in cygin.
>>
>> I'll have to try it in openSUSE tomorrow.  If it matches cygwin, then
>> Ubuntu is the loser.
>>
>> If it matches Ubuntu, then I guess you could say the cygwin fails to
>> duplicate the linux bugs present in awk!
>>
>> I'm not sure I want that good of a emulation!
>>
>> Greg
>
> I just ran this on openSUSE 11.3.
>
> It worked fine.  ie. One output line for each input line.
>
> So it is a Ubuntu issue of some sort that is eating almost 400K lines
> of data out of my expected 500+K lines of output.
>
> In openSUSE awk is a link to /bin/gawk.  In Ubuntu it is a link to mawk.
>
> But I also tried mawk from openSUSE and it also gave me one line of
> output per line of input.
>
> Looking at a diff between the two outputs it appears there are some
> control chars in the input data set.  I can understand Ubuntu
> mishandling those lines, but it apparently just goes bonkers.  At
> first it just drops 10 or so output lines for each input line with
> control chars
>
> But at line 174130 is just dies.
>
> Here's an intriguing part of the diff between the output file on
> openSUSE and the same on Ubuntu that shows the line that finally did
> Ubuntu's mawk in.
>
> Remember awk on openSUSE and cygwin are handling this data in at least
> a more or less sane way.
>
> Also, looking visually at the first few hundred of the missing lines.
> Only a handful of them have control chars in them.
>
> 174207,541986c174130
> < 15-Aug-2007 14:41:14,0,macb,0,0,0,73800,[PDF Metadata]
> (creationdate) User: þÿ^@k^@i^@m File created. Title :
> (þÿ^@I^@n^@t^@u^@i^@t^@_^@Q^@B^@O^@B^@_^@I^@n^@t^@e^@r^@n^@a^@l^@.^@p^@d^@f)
> Author: [þÿ^@k^@i^@m] Creator: [PFU ScanSnap Manager 4.0.11]
> produced by: [Adobe PDF Scan Library 2.1] (file:
> /mnt/windows7_mount//Documents and Settings/Administrator/Local
> Settings/Temp/_tmpAT/attFEFB.tmp)
> < 15-Aug-2007 15:11:02,0,macb,0,0,0,67976,[PDF Metadata]
> (creationdate) User: Olde English Manor File created. Title :
> (OEM-Rent Schedule-7-26-07.xls) Author: [Olde English Manor] Creator:
> [Acrobat PDFMaker 7.0.7 for Excel] produced by: [Acrobat Distiller
> 7.0.5 /(Windows/] (file: /mnt/windows7_mount//Documents and
> Settings/Administrator/Local Settings/Temp/_tmpAT/att4111.tmp)
> < 15-Aug-2007 15:11:04,0,macb,0,0,0,67976,[PDF Metadata] (moddate)
> User: Olde English Manor File modified. Title : (OEM-Rent
> Schedule-7-26-07.xls) Author: [Olde English Manor] Creator: [Acrobat
> PDFMaker 7.0.7 for Excel] produced by: [Acrobat Distiller 7.0.5
> /(Windows/] (file: /mnt/windows7_mount//Documents and
> Settings/Administrator/Local Settings/Temp/_tmpAT/att4111.tmp)
> < 15-Aug-2007 15:29:09,0,macb,0,0,0,73800,[PDF Metadata] (moddate)
> User: þÿ^@k^@i^@m File modified. Title :
> (þÿ^@I^@n^@t^@u^@i^@t^@_^@Q^@B^@O^@B^@_^@I^@n^@t^@e^@r^@n^@a^@l^@.^@p^@d^@f)
> Author: [þÿ^@k^@i^@m] Creator: [PFU ScanSnap Manager 4.0.11]
> produced by: [Adobe PDF Scan Library 2.1] (file:
> /mnt/windows7_mount//Documents and Settings/Administrator/Local
> Settings/Temp/_tmpAT/attFEFB.tmp)
> < 15-Aug-2007 16:16:22,0,.acb,0,0,0,9630,[Internet Explorer] (Content
> viewed/Content saved to drive)
> URL:http://z-ecx.images-amazon.com/images/G/01/digital/sitb/js/prototype.1187147005._V28380147_.js
> cache stored in: R7MT4AO6/prototype.1187147005._V28380147_[2].js -
> HTTP/1.1 200 OK - Content-Length: 39057 - Content-Type:
> application/x-javascript (file: /mnt/windows7_mount//Documents and
> Settings/Administrator/Local Settings/Temporary Internet
> Files/Content.IE5/index.dat)
> < 15-Aug-2007 18:25:13,0,macb,0,0,0,67973,[PDF Metadata]
> (creationdate) User: admin File created. Title : (Microsoft Word -
> OEM-Owners Report-7-31-07.doc) Author: [admin] Creator: [PScript5.dll
> Version 5.2.2] produced by: [Acrobat Distiller 7.0.5 /(Windows/]
> (file: /mnt/windows7_mount//Documents and Settings/Administrator/Local
> Settings/Temp/_tmpAT/att
>
> If someone thinks this is worth pursueing, I can send them the first
> 10,000 lines of data from the original input file.
>
> Ubuntu's mawk only drops 36 lines of that in the output I think.  So
> it's a more managable problem.
>
> Even though I'm pretty sure there is nothing proprietary in that
> dataset, and definitely there is no client data, I still don't want to
> see it posted somewhere public like a bugzilla would be.
>
> Greg
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>



-- 
.!# RichardBronosky #!.



More information about the Ale mailing list