What have you found for these years?

2008-06-25

ruby character encoding detection

最早是 mozilla 的程式:
Mozilla Charset Detectors
source code
相關文獻

有人移植到 Python 上:
http://chardet.feedparser.org/

非常好,再移植到 Ruby 上吧:
http://rubyforge.org/projects/chardet/

沒什麼文件的樣子,這邊寫個範例:

> sudo gem install chardet
Password:
Successfully installed chardet-0.9
1 gem installed

> irb
irb(main):001:0> require 'rubygems'; require 'UniversalDetector'
=> true
irb(main):002:0> require 'open-uri'
=> true
irb(main):003:0> UniversalDetector.encoding open('http://www.falcom.co.jp').read
=> "SHIFT_JIS"
irb(main):004:0> UniversalDetector.encoding open('http://godfat.org').read
=> "utf-8"
irb(main):005:0> UniversalDetector.encoding open('http://www.blizzard.com').read
=> "ascii"
irb(main):006:0> UniversalDetector.encoding open('http://sysoev.ru/nginx').read
=> "KOI8-R"
irb(main):007:0> UniversalDetector.encoding open('http://www.softstar.com.tw').read
=> "Big5"
irb(main):008:0> UniversalDetector.encoding open('http://www.baidu.com').read
=> "GB2312"
irb(main):009:0> UniversalDetector.encoding open('http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html').read
=> "ISO-8859-2"

效能不清楚,大小寫不論,正確性應該還可以。

btw, 雖然我討厭 unicode 以外的編碼,
不過希望這些網站 encoding 不會改,不然就跟這裡的範例對不上了 XD
(其實我還以為要找到 big5 的網站很難了說...)

2 retries:

MB said...

還好他們真的沒改 (噓~)

我是說畢個拜那家~

Lin Jen-Shin (godfat) said...

顆顆
long live 畢個拜

Post a Comment

Note: Only a member of this blog may post a comment.



All texts are licensed under CC Attribution 3.0