[ale] language(locale) detection in gmail

Jerry Yu jjj863 at gmail.com
Fri Oct 6 10:35:48 EDT 2006


the email has a few photos attached. Somewhere in the headers, I found
"Content-Type: text/html; charset=GB2312". GB2312
is a charset to encode Simplified Chinese most commonly used in
mainland China. So, one couldn't really mistake it as a Japanese
encoding,
if this line is used for detection instead of other UTF-8 encoding
headers found elsewhere in the headers.



On 10/6/06, Michael B. Trausch <fd0man at gmail.com> wrote:
>
> Jerry Yu wrote:
> > Reading an email in Chinese, I noticed that all sponsored links served
> > by gmail are in Japanese. With my limited Japanese training (one year in
> > college), I can tell the links, albeit in wrong language, are actually
> > pertinent to the content of  the email.
> > This comes to a question, anybody know how google,or anybody for the
> > matter, detect the locale (charset encoding?), given a chunk of text?
> >
>
> My guess -- and mind you, this is only a guess -- is that it would be
> something to do with a combination of headers, characters and words.
> For example, the e-mails that I compose are in UTF-8, even though I
> mostly use the ASCII characters that represent Roman languages.
>
> The messages that I send go out with the following header in the
> plain-text portion of the e-mail:
>
> Content-Type: text/plain; charset=UTF-8
>
> Which tells whatever software that is reading it that it can expect in
> that section to find plain text encoded as UTF-8.  From there, all it
> needs to do it identify the language.  If I were to put a bunch of
> Chinese/Japanese characters in my e-mail, it would likely identify that,
> because those characters only live in a subset of the characters that
> comprise of Unicode, just as these ASCII letters that I am typing do.
>
> As I understand it, Chinese and Japanese share a common written
> language, and if that is the case, it is possible that it is detecting
> the character set based on the glyphs that are used, and pulling
> advertisements from their files that are comprised of the same subset of
> Unicode characters, even though it is a different language.
>
>         -- Mike
>
> --
> Michael B. Trausch <fd0man at gmail.com> - Jabber: fd0man at livejournal.com
>
> Demand freedom: Use open and free protocols, standards, and software.
>
>
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...




More information about the Ale mailing list