posted by kevin on October 21, 2009

Microsoft Smart Quotes & PHP

It seems like I run into this issue for over half of the websites that I work on. The user wants to copy-and-paste their article, document, or whatever from MS Word, into a textarea and save it. The problem is word uses funky quotes, dashes, and other characters. Once it's submitted, PHP gets it and the characters are encoded differently and they display weird.

We've tried several different methods to try to eliminate the problem, but every time I googled for a fix, I never really found anything that worked well.

My boss mentioned that maybe we could fix it on the client-level, so I dove in and found what seems to be a promising fix. It's clean and simple. This function currently only replaces the single and double smart quotes, and then the strange dash character that MS word uses. Feel free to submit for character conversion codes and I'll add it to the function, I'll also add to this once I come across more problems.

Javascript function to replace Microsoft Smart Quotes with regular quotes.

 
function removeMSWordChars(str) {
    var myReplacements = new Array();
    var myCode, intReplacement;
    myReplacements[8216] = 39;
    myReplacements[8217] = 39;
    myReplacements[8220] = 34;
    myReplacements[8221] = 34;
    myReplacements[8212] = 45;
    for(c=0; c<str.length; c++) {
        var myCode = str.charCodeAt(c);
        if(myReplacements[myCode] != undefined) {
            intReplacement = myReplacements[myCode];
            str = str.substr(0,c) + String.fromCharCode(intReplacement) + str.substr(c+1);
        }
    }
    return str;
}
 

This is the jQuery that will run the filter on all textareas on your page when you tab away from the textarea (Assumes you have jQuery installed and running on the page.)

 
$(function(){
    $("textarea").blur(function(){
        $(this).val(removeMSWordChars(this.value));
    });
});
 

Removing smart quotes javascript example



Or if you don't use jQuery and you're a little green to javascript you can do this:

 
<textarea onBlur="this.value=removeMSWordChars(this.value);" name="a" rows=5 cols=10></textarea>
 

19 comments to "Replace Microsoft Word smart quotes and other characters with Javascript"

#78
November 11, 2009 at 11:02 pm
Consider using this: str = str.value.replace(new RegExp(String.fromCharCode(8216), 'g'), "'" );
#83
gusti says:
November 19, 2009 at 03:22 am
You can also try this one: http://www.dancrintea.ro/doc-to-pdf/
#84
November 19, 2009 at 08:54 am
Gusti, how would a Document to PDF generator using JAVA work for this purpose?
#100
colin says:
January 4, 2010 at 12:16 pm
Just a big-picture question: Why use the jquery code when the "green" version is preferable? Neither form allow for code reuse, but the green version is shorter, more simple, and more self-documenting.
#101
January 4, 2010 at 12:32 pm
The difference is that I can create an app.js file or something similar that is loaded on every page in our header or template php file, and when I created this, our application had over 120 forms using probably around 300 total textareas, and this function is safe to bind to each textarea. So that jQuery code binds that action to each textarea site-wide. Otherwise I'd have to put the 'onblur' action on every single textarea I wanted the function to ran on.

jQuery syntax is a bit different if you're not used to using it, but it's very powerful. Hope that answers your question.
#107
Phil says:
February 25, 2010 at 10:54 am
Great code!! I have been looking for something like this for a long time... Thanks!
#118
Somit Baranwal says:
March 30, 2010 at 04:22 am
special charecter dash is not working. when i copied and pasted from microsoft word the dash charecter get converted to some special hyphen charecter and does not changes on blur. this is the string i entered "he is –there"
#122
Nick Ippoliti says:
April 23, 2010 at 11:26 am
Kevin, this worked very well for me. What about bullets from Word doc? They also render the goofy chars. Thanks.
#134
mark says:
May 24, 2010 at 07:00 am
i decided to use it with an onInput command instead of onBlur and it works well. there are many more word characters out there.. so i'll be able to build on this as we run into them. thanks
#136
June 1, 2010 at 08:56 pm
Kevin, if I had a hero, you would be it for today. I've been trying to solve this problem for weeks, and the only thing I've found is, "Microsoft is non-standard, so there's no standard way of fixing it." Thanks. It seems to work fine.
#137
June 1, 2010 at 08:58 pm
Hahaha, thanks Aaron! I had this issue for years, I had tried to solve it many different ways, but found this to really work the best. I'm glad to see this is helping out others.
#143
June 10, 2010 at 09:00 am
Cool code. I've rewrote it to PHP and using it after user post html content in admin panel of my cms. Thanks.
#144
June 10, 2010 at 12:40 pm
Thanks for this - I do find it's really common for people to paste MS Word content into textareas but the use of jQuery makes applying this fix across the site really simple. I'm sure the list of characters will need adding to but it's a great start.
#145
June 10, 2010 at 12:47 pm
Please let me know what other characters you run into, I'll add it to the list and give you credit. Thanks!
#146
June 11, 2010 at 03:15 am
No problem. I think a decent starting point for a list of the more common characters would be... myReplacements[8211] = "-"; myReplacements[8212] = "-"; myReplacements[8216] = "'"; myReplacements[8217] = "'"; myReplacements[8218] = "'"; myReplacements[8220] = '"'; myReplacements[8221] = '"'; myReplacements[8222] = '"'; myReplacements[8224] = "+"; myReplacements[8226] = "."; myReplacements[8230] = "..."; myReplacements[8249] = "<"; myReplacements[8250] = ">"; I've used strings rather than numeric codes for the substituted values here just so you can see what's being substituted and to allow multiple characters.
#151
Satish Gadhave says:
June 22, 2010 at 07:40 am
This is working great! Is there any similar trick for data imported from CSV file as Javascript don't work there? Thanks.
#158
Sam says:
July 23, 2010 at 10:26 am
I am trying to replace myReplacements[8230] = "..." but I could NOT find what is the ASCII for "..." , one website suggested 133 but it actually converts to a square..any help ??????
#159
July 23, 2010 at 10:40 am
Sam, i created this for you. Find Ascii code
#166
lee says:
August 16, 2010 at 11:12 pm
This was just what the dr ordered. Thanks.
Bookmark and Share

Leave a Comment

Your email address will not be published.

(You can enclose code in <php></php> blocks.)

You may use Markdown syntax.

Please enter the letters as they are shown in the image above.
Letters are not case-sensitive.