星之一角: module XhtmlFormatter

（updated: 也許可以試著加上 <code lang="ruby" style="twilight"> 或是
<code lang="c++"> 的 coderay support, 翻譯完就丟掉 code tag）

因應古怪需求，用 hpricot 硬幹一個 xhtml formatter...
輸入是文章，輸出是可以當作留言的 (x)html code

format_article '文章', :a, :img, :pre, :b, :em, :strong, :i
第一個參數是文章內容，後面的則是所有允許的 tag
不被允許的 tag 會被 escape 掉 <
目前只 escape < 成 <, 原因是...
（updated: pre 內的 & 也會 escape 掉）
我忘了 @@ 總之剛剛測試是全部 escape 掉會有小麻煩...
所以有關於 & 之類的東西，可能會有點 bug, 這要再試試看
（其實是有點不爽，再加上今天整天注意力都很不集中...）
（換言之就是狀況差）

擇日再加到 ludy 中...

轉換的原則如下：
1. 除了 pre 包起來的區塊外，所有的 \n, \r\n, \r 全部換成 <br />
2. pre 包起來的區塊內，禁止使用 html, 全部一律 escape 掉 < 和 &
3. 因為上面的規則，加上 regexp 採用 greedy 原則，所以只有最外層 pre 有效
4. 忘記關閉 pre 的話，自動在文章最末端補上
5. 所有網址會轉換成有 a href 的連結，regexp 取自 drupal...
6. 網址包含各種通訊協定開頭的字串，還有 www. 開頭的網址，還有 email address.
7. pre 包起來的區塊內也有網址轉換成連結的效果

hpricot 效果蠻強的，還會自動修復一些 html,
不過介面敬謝不敏，細節不多說了（我現在只想筆記）

source code（正好試試效果如何！）：


require 'set'
require 'rubygems'
require 'hpricot'

# 2008-05-09 godfat
module XhtmlFormatter
  module_function
  def format_article html, *allowed_tags
    allowed_tags = Set.new allowed_tags
    XhtmlFormatter.format_article_elems Hpricot.parse(
      XhtmlFormatter.escape_all_inside_pre(html, allowed_tags)), allowed_tags
  end

  def format_autolink html
    doc = Hpricot.parse html
    doc.each_child{ |c|
      next unless c.kind_of?(Hpricot::Text)
      c.content = format_url c.content
    }
    doc.to_html
  end

  def format_url text
    # translated from drupal-6.2/modules/filter/filter.module
    # Match absolute URLs.
    text.gsub(
%r{((http://|https://|ftp://|mailto:|smb://|afp://|file://|gopher://|news://|ssl://|sslv2://|sslv3://|tls://|tcp://|udp://|www\.)([a-zA-Z0-9@:%_+*~#?&=.,/;-]*[a-zA-Z0-9@:%_+*~#&=/;-]))([.,?!]*?)}i){ |match|
      url = $1 # is there any other way to get this variable?
      caption = XhtmlFormatter.trim url
      if url =~ %r{^http://}
        '<a href="'+url+'" title="'+url+'">'+caption+'</a>'
      else # Match www domains/addresses.
        '<a href="http://'+url+'" title="'+url+'">'+caption+'</a>'
      end

    # Match e-mail addresses.
    }.gsub( %r{([A-Za-z0-9._-]+@[A-Za-z0-9._+-]+\.[A-Za-z]{2,4})([.,?!]*?)}i,
            '<a href="mailto:\1">\1</a>')
  end

  def format_newline text
    # windows: \r\n
    # mac os 9: \r
    text.gsub("\r\n", "\n").tr("\r", "\n").gsub("\n", '<br />')
  end

  private
  def self.trim text, length = 50
    # Use +3 for '...' string length.
    if text.size <= 3
      '...'
    elsif text.size > length
      "#{text[0...length-3]}..."
    else
      text
    end
  end
  def self.escape_all_inside_pre html, allowed_tags
    return html unless allowed_tags.member? :pre
    # don't bother nested pre, because we escape all tags in pre
    html = html + '</pre>' unless html =~ %r{</pre>}i
    html.gsub(%r{<pre>(.*)</pre>}mi){
      # stop escaping for '>' because drupal's url filter would make &gt; into url...
      # is there any other way to get $1?
      "<pre>#{XhtmlFormatter.escape_lt(XhtmlFormatter.escape_amp($1))}</pre>"
    }
  end
  def self.format_article_elems elems, allowed_tags = Set.new, no_format_newline = false
    elems.children.map{ |e|
      if e.kind_of?(Hpricot::Text)
        if no_format_newline
          format_url(e.content)
        else
          format_newline format_url(e.content)
        end
      elsif e.kind_of?(Hpricot::Elem)
        if allowed_tags.member? e.name.to_sym
          if e.empty?
            e.to_html
          else
            e.stag.inspect +
              XhtmlFormatter.format_article_elems(e, allowed_tags, e.stag.name == 'pre') +
              (e.etag || Hpricot::ETag.new(e.stag.name)).inspect
          end
        else
          if e.empty?
            XhtmlFormatter.escape_lt(e.stag.inspect)
          else
            XhtmlFormatter.escape_lt(e.stag.inspect) +
              XhtmlFormatter.format_article_elems(e, allowed_tags) +
              XhtmlFormatter.escape_lt((e.etag || Hpricot::ETag.new(e.stag.name)).inspect)
          end
        end
      end
    }.join
  end
  def self.escape_amp text
    text.gsub('&', '&amp;')
  end
  def self.escape_lt text
    text.gsub('<', '&lt;')
  end
end

禁止餵食

日期分類

標籤分類

星之一角

2008-05-09

module XhtmlFormatter

0 retries:

Post a Comment

favorite albums