I spent way too long this weekend on a problem that had such a simple solution. I guess this issue may have been a little to do with the fact that I use the CodeIgniter framework, which does so much of the hard work for you. it’s easy to get complaisant.
I have been working with text files that contain multi-byte characters and had previously ensured that my database and tables were setup for UTF-8 and that everything in codeigniter was correctly configured. Yet still I was getting invalid character errors on the database insert.
As the text files were of varying formats, including excel’s unicode csv format, I had already ensured that the reading of the text file also included conversion to UTF-8. Thanks to the script on Practical Web Ltd, I was attempting to detect the format of the files and converting them to UTF-8 on the fly. Yet still I was getting invalid character errors on the database insert.
I even ran through my code line by line and checked for any string manipulation I was doing using non-safe string functions. Yet still I was getting invalid character errors on the database insert.
If I had any decent amount of hair left, I would certainly have pulled it all out by the time I figured out what was wrong. I only discovered the answer by accident when I decided to remove the string manipulation altogether. As soon as I did that, it worked a treat. Had I discovered a bug in the multibyte string functions? No.
I had not checked the default encoding of mbstring.
So please, make sure it is on your check list of things to do when dealing with multi-byte strings. Set up the default correctly or religiously use the encoding parameter in the multi-byte string functions.
Even better, you could use the great checklist on nicknettleton.com (see below), which seems to cover everything.
I totally deserved the dunce hat.
Edit: Looks like the link on nicknettleton.com is no longer available (thanks @Les). A little digging around led me to the same checklist on php UTF-8 on another site.
@Les Thanks for pointing out the dead link. I have updated the post – I found the same article on another website. I hope it is of some use.
Your link to nicknettleton is also broken; your link isn’t the only one though as many other blogs link to same site too.
If you cannot resolve the url, could you remove it?
Thanks
Thanks for pointing out the bad link.
Your link to Codeigniter is wrong, the correct URL is: http://codeigniter.com.
I also struggled with UTF-8 over the weekend, the cheatsheet is very helpfull!