[ale] Dealing with really big log files....

JK jknapka at kneuro.net
Mon Mar 23 17:51:26 EDT 2009


Michael B. Trausch wrote:
> On Mon, 23 Mar 2009 12:29:02 -0600
> JK <jknapka at kneuro.net> wrote:
> 
>> Yeah but... who cares?  You can just trim any partial lines from
>> the front and back of the resulting file.  And doing a binary search
>> (manually if need be) for the interesting chunk is probably quicker
>> than scanning through 100 GB of junk.
> 
> Possibly.  Depends on how heavily loaded the system is from an I/O
> standpoint---you've got the advantage of readahead caching if you're
> scanning sequentially, and that advantage doesn't seem like much until
> the system is really heavily bogged down.  I wasn't making the
> assumption that this work was being done on a lightly-loaded desktop
> machine.  A very heavily loaded machine would really suck to be doing a
> binary search on, since you'd likely be I/O bound at every iteration.
> 
> Anyway, tomato/tomahto.  They're both valid approaches, but the right
> one would (as is always the case) depend on more variables than were
> ever discussed in the first place.


Er.  I'm going to have to disagree.  The OP said he knows approximately
how far into the file is the data of interest.  Pulling 100GB of data off
the disk and through the cache only to throw it away again, just to get
to data that is 100.001GB into the file, is never going to be a good
answer.  It is going to blow everything else out of the kernel's cache
except for pages that get hit extremely frequently (like libc.so).
So most likely, every process that isn't actively using the CPU will
get a ton of useful pages evicted from RAM in order to make way for
this monster do-nothing job. Ugh.

If you seek() to the appropriate place (or just a point 500MB
before the appropriate place, if you're not sure exactly where to look)
and start reading from there (in a line-oriented fashion, if you like),
then the first 99.5GB of data is simply never touched.  The kernel only
has to read the inode catalog for the file -- which admittedly is
gonna be kinda big for a 100GB file, but still a tiny, tiny fraction
of the whole.

-- JK

-- 
I do not particularly want to go where the money is -
  it usually does not smell nice there. -- A. Stepanov


More information about the Ale mailing list