How to remove non-UTF8 characters from strings with PHP

Short post this one – seem to be having some trouble generating an XML feed from a database of over 10,000 listings and remove non-UTF8 characters from the feed. Well, PHP to the rescue. There are many ways of doing this, but find below some regexes that I have tried and tested and work pretty well.

//reject simple non utf-8 xters
preg_replace('/[^(\x20-\x7F)]*/','', $string);
//reject overly long 2 byte sequences,as well as characters above U+10000 and replace with nothing
preg_replace('/[x00-x08x10x0Bx0Cx0E-x19x7F]|[x00-x7F][x80-xBF]+|([xC0xC1]|[xF0-xFF])[x80-xBF]*|[xC2-xDF]((?![x80-xBF])|[x80-xBF]{2,})|[xE0-xEF](([x80-xBF](?![x80-xBF]))|(?![x80-xBF]{2})|[x80-xBF]{3,})/S','?', $string );
//reject overly long 3 byte sequences and UTF-16 surrogates and replace with nothing
preg_replace('/xE0[x80-x9F][x80-xBF]|xED[xA0-xBF][x80-xBF]/S','?', $string );

Should these not work for you, comment!

Share

3 comments

Leave a Reply