posted by kevin on October 21, 2009

Microsoft Smart Quotes & PHP

It seems like I run into this issue for over half of the websites that I work on. The user wants to copy-and-paste their article, document, or whatever from MS Word, into a textarea and save it. The problem is word uses funky quotes, dashes, and other characters. Once it's submitted, PHP gets it and the characters are encoded differently and they display weird.

We've tried several different methods to try to eliminate the problem, but every time I googled for a fix, I never really found anything that worked well.

My boss mentioned that maybe we could fix it on the client-level, so I dove in and found what seems to be a promising fix. It's clean and simple. This function currently only replaces the single and double smart quotes, and then the strange dash character that MS word uses. Feel free to submit for character conversion codes and I'll add it to the function, I'll also add to this once I come across more problems.

Javascript function to replace Microsoft Smart Quotes with regular quotes.

 
function removeMSWordChars(str) {
    var myReplacements = new Array();
    var myCode, intReplacement;
    myReplacements[8216] = 39;
    myReplacements[8217] = 39;
    myReplacements[8220] = 34;
    myReplacements[8221] = 34;
    myReplacements[8212] = 45;
    for(c=0; c<str.length; c++) {
        var myCode = str.charCodeAt(c);
        if(myReplacements[myCode] != undefined) {
            intReplacement = myReplacements[myCode];
            str = str.substr(0,c) + String.fromCharCode(intReplacement) + str.substr(c+1);
        }
    }
    return str;
}
 

This is the jQuery that will run the filter on all textareas on your page when you tab away from the textarea (Assumes you have jQuery installed and running on the page.)

 
$(function(){
    $("textarea").blur(function(){
        $(this).val(removeMSWordChars(this.value));
    });
});
 

Removing smart quotes javascript example



Or if you don't use jQuery and you're a little green to javascript you can do this:

 
<textarea onBlur="this.value=removeMSWordChars(this.value);" name="a" rows=5 cols=10></textarea>
 

26 comments to "Replace Microsoft Word smart quotes and other characters with Javascript"

#78
November 11, 2009 at 11:02 pm
Consider using this: str = str.value.replace(new RegExp(String.fromCharCode(8216), 'g'), "'" );
#83
gusti says:
November 19, 2009 at 03:22 am
You can also try this one: http://www.dancrintea.ro/doc-to-pdf/
#84
November 19, 2009 at 08:54 am
Gusti, how would a Document to PDF generator using JAVA work for this purpose?
#100
colin says:
January 4, 2010 at 12:16 pm
Just a big-picture question: Why use the jquery code when the "green" version is preferable? Neither form allow for code reuse, but the green version is shorter, more simple, and more self-documenting.
#101
January 4, 2010 at 12:32 pm
The difference is that I can create an app.js file or something similar that is loaded on every page in our header or template php file, and when I created this, our application had over 120 forms using probably around 300 total textareas, and this function is safe to bind to each textarea. So that jQuery code binds that action to each textarea site-wide. Otherwise I'd have to put the 'onblur' action on every single textarea I wanted the function to ran on.

jQuery syntax is a bit different if you're not used to using it, but it's very powerful. Hope that answers your question.
#107
Phil says:
February 25, 2010 at 10:54 am
Great code!! I have been looking for something like this for a long time... Thanks!
#118
Somit Baranwal says:
March 30, 2010 at 04:22 am
special charecter dash is not working. when i copied and pasted from microsoft word the dash charecter get converted to some special hyphen charecter and does not changes on blur. this is the string i entered "he is –there"
#122
Nick Ippoliti says:
April 23, 2010 at 11:26 am
Kevin, this worked very well for me. What about bullets from Word doc? They also render the goofy chars. Thanks.
#134
mark says:
May 24, 2010 at 07:00 am
i decided to use it with an onInput command instead of onBlur and it works well. there are many more word characters out there.. so i'll be able to build on this as we run into them. thanks
#136
June 1, 2010 at 08:56 pm
Kevin, if I had a hero, you would be it for today. I've been trying to solve this problem for weeks, and the only thing I've found is, "Microsoft is non-standard, so there's no standard way of fixing it." Thanks. It seems to work fine.
#137
June 1, 2010 at 08:58 pm
Hahaha, thanks Aaron! I had this issue for years, I had tried to solve it many different ways, but found this to really work the best. I'm glad to see this is helping out others.
#143
June 10, 2010 at 09:00 am
Cool code. I've rewrote it to PHP and using it after user post html content in admin panel of my cms. Thanks.
#144
June 10, 2010 at 12:40 pm
Thanks for this - I do find it's really common for people to paste MS Word content into textareas but the use of jQuery makes applying this fix across the site really simple. I'm sure the list of characters will need adding to but it's a great start.
#145
June 10, 2010 at 12:47 pm
Please let me know what other characters you run into, I'll add it to the list and give you credit. Thanks!
#146
June 11, 2010 at 03:15 am
No problem. I think a decent starting point for a list of the more common characters would be... myReplacements[8211] = "-"; myReplacements[8212] = "-"; myReplacements[8216] = "'"; myReplacements[8217] = "'"; myReplacements[8218] = "'"; myReplacements[8220] = '"'; myReplacements[8221] = '"'; myReplacements[8222] = '"'; myReplacements[8224] = "+"; myReplacements[8226] = "."; myReplacements[8230] = "..."; myReplacements[8249] = "<"; myReplacements[8250] = ">"; I've used strings rather than numeric codes for the substituted values here just so you can see what's being substituted and to allow multiple characters.
#151
Satish Gadhave says:
June 22, 2010 at 07:40 am
This is working great! Is there any similar trick for data imported from CSV file as Javascript don't work there? Thanks.
#158
Sam says:
July 23, 2010 at 10:26 am
I am trying to replace myReplacements[8230] = "..." but I could NOT find what is the ASCII for "..." , one website suggested 133 but it actually converts to a square..any help ??????
#159
July 23, 2010 at 10:40 am
Sam, i created this for you. Find Ascii code
#166
lee says:
August 16, 2010 at 11:12 pm
This was just what the dr ordered. Thanks.
#212
Debbie says:
November 10, 2010 at 10:54 am
This code is great! I have one question though ... Had anyone had experience with removing the extra line breaks and spaces that result from someone copying/pasting from an Outlook message into a form textarea field? Any ideas would be appreciated. So far, I have only managed to replace \n or \r with a

(paragraph) tag which does not seem to skew the display when there are mutiple paragraph tags in the results, but I'm wondering if there is a better way to do it.

#219
November 16, 2010 at 03:08 pm
If you want to use John Patricks tip on using strings instead of integers for replacements you need to modify the replace statement to be: if(myReplacements[myCode] != undefined) { intReplacement = myReplacements[myCode]; str = str.substr(0,c) + myReplacements[myCode] + str.substr(c+1); }
#221
Swetha says:
November 17, 2010 at 05:34 am
This code works great. Where can I get more codes (like 8212, 8230 etc) for replacement? Which encoding is this based on? Thanks.
#231
Rich says:
December 16, 2010 at 09:02 am
Swetha, If you open a Word doc, then click on "insert" and "symbol", you'll get a dialogue in which you can select each symbol and see the character code for that symbol toward the bottom right of the dialogue.
#232
Rich says:
December 16, 2010 at 09:05 am
Swetha, oops. I'm not sure my suggestion works. The codes in the symbol dialogue do not start with "82" like the codes above. Can anyone enlighten me on this?
#289
Visulis says:
June 5, 2011 at 07:04 pm
Thanks. To do it with PHP: http://www.toao.net/48-replacing-smart-quotes-and-em-dashes-in-mysql Many pages do the opposite: you enter the keyboard default quotes (vertical) and after submiting the data they convert them to curly ones. They may do it with PHP code similar to this one (works well): http://pastebin.com/CEK0NN43 But the problem on these pages is that if the conversion is done to computer code it normally doesn't work, so you have to re-convert all the quotes back to vertical ones.
#292
gaffe says:
August 19, 2011 at 12:25 pm
THANKS! ON THE WHOLE WEB THIS SEEMS LIKE THE ONLY SOLUTION THAT WORKS PAINLESSLY. WELL WRITTEN CODE TOO. But...does it ever replace the wrong characters? I mean what if a user isn't using a word character set but plain text, does it mess up normal text as a tradeoff to getting word cut and paste to work or are these character codes not common in other character sets?
Bookmark and Share

Leave a Comment

Your email address will not be published.

(You can enclose code in <php></php> blocks.)

You may use Markdown syntax.

Please enter the letters as they are shown in the image above.
Letters are not case-sensitive.