[ale] 100 million Facebook pages leaked on torrent site

Michael B. Trausch mike at trausch.us
Sun Aug 1 11:15:45 EDT 2010


On Fri, 2010-07-30 at 11:55 -0400, Jim Philips wrote:
> I saw a report today that major corporations are already downloading
> the file through BitTorrent. A free goldmine of information for them!

I have already downloaded it myself, just to take a look at what's
actually in the whole thing.

There is a *lot* of data, mostly names, but also URLs to profile pages
for each of those names.  It's about 17GB worth of data, enough to burn
to a BD-R for storage.  It's not indexed, just plain-text, along with
counts for various names which could be used to determine popularity, as
an example.

I can see some of this data taking the place of 1930 Census Data in
terms of storage of proper names, such that businesses that use the aid
of data to parse free-form documents would benefit.

Here are the ten most listed first names (with frequency of occurrence):

 977014 michael
 963693 john
 924816 david
 819879 chris
 640957 mike
 602088 james
 584438 mark
 515686 jason
 503658 robert
 484403 jessica

And the ten most listed last names (also with frequency of occurrence):

 913465 smith
 571819 johnson
 512312 jones
 503266 williams
 471390 brown
 386764 lee
 360010 khan
 355639 singh
 343220 kumar
 324972 miller

I guess "Michael Smith" would be the most generic name possible if you
look at those numbers. :-)

I'm not sure what there really is in terms of useful data that companies
could use, other than having a large pool of names to be able to pick
from for things like random name generators, or parsers that look for
proper names in freeform documents, or other fairly specific things such
as that.  Perhaps it's possible to use it for more than I envison, as
well.

It seems (at least from where I sit) that the Web site that is supposed
to have more information about the whole thing is unreachable; I get 17
hops before my packets to the thing enter some form of black hole on the
Internet in Canada.  Oops.

Anyway, it's interesting, though of only limited use, I think; I don't
know that it contains enough information (by itself) to be harmful,
though I suppose that if you could combine it with other databases that
have additional data, it could be potentially detrimental.

One thing that I had expected to see based on all the chatter about it
was some form of relationship graph, say, showing who has friended who
on Facebook.  That would be something that I could see companies easily
(ab)using for things like debt collection purposes.  However, that sort
of data doesn't seem to be present, which I would consider to be a good
thing.

	--- Mike



More information about the Ale mailing list