Sanitizing UTF8 in ruby

Rails 2.3.4 was released largely to address an XSS exploit where user input with specialy crafted UTF8 could bypass the normal sanitize hepers.

The problem is that string like

"\xF0<script"

Would be interpreted by ruby as

["\xf0<", "s", "c", "r", "i", "p", "t"]

However, a web browser might perform additional cleanup to interpret it as a script tag. Further, though less serious, problems with this are that it can sometimes cause HPricot to segfault, and libxml-ruby will raise an exception.

The solution was to sanitize the invalid UTF8 code before displaying it to the browser. If you need to use this outside of the SanitizeHelper methods, ActiveSupport::Multibyte::clean() and ActiveSupport::Multibyte::verify() have been created. commit

If you need something similar but aren’t running rails or can’t upgrade to 2.3.4, here’s a solution. You can split up a string by its utf characters by splitting against an empty regex string.

"\xF0<script".split(//u)

and then matching against a regex (borrowed from Instiki) of valid UTF8 characters

UTF8_REGEX = /\A(
       [\x09\x0A\x0D\x20-\x7E]            # ASCII
     | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
     |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
     | [\xE1-\xEC\xEE][\x80-\xBF]{2}      # straight 3-byte
     |  \xEF[\x80-\xBE]{2}                # 
     |  \xEF\xBF[\x80-\xBD]               # excluding U+fffe and U+ffff
     |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
     |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
     | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
     |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
   )*\Z/nx;
"\xF0<script".split(//u).grep(UTF8_REGEX).join # => "cript"

After that you can safely sanitize your HTML and expect web browsers to interpret it the same way.