Nick Interesting – how does it 'break' such things? A stray BOM in the page should just be interpreted as a ZWNJ. If it were in the middle of a long word that happened to fall near the end of a line, you’d get an unexpected word break, but I don’t see how that could be thought of as being particularly 'broken'. Moreover, I don’t see how you’d end up with two parts of a word in different files. Or does it output the 'no glyph' box symbol for ZWNJs?
Can you expand on this? But, the general problem is one of the main reasons why BOMs are discouraged. It breaks the semantics of being able to concatenate two text files together simply by appending one stream of bytes to another. You now have an extra character in the resultant stream that was not 'in' the original files.
Pcooper — True, but depending on the dialog, they might not have to know anything about character encodings to answer it, either. If it’s possible to narrow the (large) number of possible encodings down to a smaller number (say two to five), it may be possible to design a dialog that would allow the user to choose an encoding from this smaller list. Show them a preview of the text as it would display under the currently-selected encoding; then they can just switch between encodings until they find one that makes the text look right. I don’t think it’d be a good idea to require the user to go through this dialog whenever Notepad isn’t sure (because as Raymond says, it almost never is sure). It’s probably a bad idea to require this dialog to come up at any point, in fact.
But it should be possible to have it as a right-click option or a menu item somewhere; that way when a file is guessed wrong, at least the user can override the wrong choice. (And to make it even more usable, it may be good to give the dialog an option for “save this choice with this file” if the chosen encoding will allow it — when the user chooses that, you’d add a BOM, or something, so there’s no ambiguity when reloading the file. Or you’d add an alternate data stream to the file that acts as an encoding hint (which, since it isn’t a BOM, won’t affect concatenating files — though I haven’t thought much about that, so it may have other problems). But if the user doesn’t have to keep bringing up this dialog every time they look at the file, that’d be good.) Of course maybe this is all too much complexity for a simple program like Notepad, too.
I guess one of the traits of a nitpicker is that it does not not realize he is a nitpicker. For him, the point is valid and important. So, even risking to be classified as nitpicker, I will make two (unrelated) points:. Not sure what '8-bit ANSI' is. In the Windows lingo 'ANSI' means 'the default system locale' (or 'Language for Non-Unicode programs', in XP UI terminology). This includes double byte code pages like 932 (Japanese), 936 (Chinese Simplified), 949 (Korean), and 950 (Chinese Traditional). They can still be considered 8-bit if you consider the 'coding unit' (Unicode terminology), but is confusing.
Xml Files Open With Wrong Encoding Textastic For Mac Os
In for this article, just ANSI should suffice. I agree that it is not really possible to make the detection smarter. But there are a couple of easy improvements: a. When I select 'Save As' and UTF-8, also give me a BOM option (I am in 'help the application' mode anyway) b.
When opening a UTF-8 no-BOM file and the encoding is properly detected, don’t add the BOM 'just in case.' It was probably no-BOM for a reason. I will not even try to suggest 'if you cannot detect the encoding 100% reliable, ask the user':-) Would be nice, but for this there are some free tools out there, and easy to find. Adam– The IE behavior is varied and unpredictable. I’ve seen it to cause stylesheets to refuse to load unless one manually refreshes the page with an F5 (regardless if the page is already cached). It can also result in bizarre encoding vomit characters to be output, and sometimes extra line breaks.
But the most annoying is the stylesheet issue. To be fair, some people say that PHP or the SSI on your server of choice should be able to detect the 'first' include and strip the BOM from subsequent ones, but it seems like this would be hard to figure out, especially considering that output buffering may or may not be off (if output buffering is on, the order of includes is not necessarily the order they appear on the page). Side note: It should also be noted that there’s a lot of misinformation on the issue, especially since HTTP’s concept of expressing encoding is through 'charset', when it’s specifying not a character set (HTML is Unicode) but instead an encoding. (I personally find the whole idea of referring to UTF-16 as 'Unicode' to refer to both the encoding and the character set, as Notepad’s save dropdown does, terribly confusing. I mean, you have 'UTF-8, Unicode, Unicode big endian'. They’re all Unicode. But, to MS’s credit, I do believe that’s standard practice.) Some linkage from the Google:.
It because of this sort of issue that the XML spec is very specific about how a file begins. An XML file must start with '. 'you are going to be flat-out wrong when you run into a Unicode file that lacks a BOM, since you’re going to misinterpret it as either UTF-8 or (more likely) 8-bit ANSI' UTF-8 is also a Unicode encoding, right? So a UTF-8 file is also a Unicode file.
Ofcourse, the real problem here is with the handling of file types and file metadata. If a file type could be perfectly transported in a platform independent way, this problem would not exist. Unfortunately, we have kludges like using the part after a period to judge the file type. A content-type header from HTTP is never saved along the file, and it would be lost anyway when the file is transmitted to, for instance, an FTP server. Mac OS uses a data fork for the file and a resource fork to store metadata, but it too can’t be psychic when it recieves a 'text/plain' file over HTTP. It does, however, allow the user to rename the file to anything.
Nautilus (and AFAIK Konqueror too) sniffs the mime type from the file contents itself, but it warns about a security risk when there’s a mismatch with the filename and refuses to open it until the name matches the actual type of file. Now about the problem with IE and a BOM in the middle of a page I haven’t heard of it before, but it might be avoidable by doing something like this:
One solution that hasn’t been considered is to create a new file extension specifically for UTF8 or UCS text. You saw something similar to this during the BBS era, where text files with ANSI colour codes had the extension.ans while plain ASCII text used.asc.
Anyway, the first thing most experienced developers who work with plaintext do when setting up a system is to install a third-party editor, due to the deficiencies of Notepad. My personal fav is Notepad a Scintilla-based editor which supports code folder, macros and even a hex editor. There needs to be a metric from the IsTextUnicode (or preferably a replacement) API that gives the caller a measure of probability rather than a boolean value.
If there’s a BOM, it’s pretty sure. If there’s a fair bit of text, it’s going to be pretty sure too. If it’s not sure (we want a probability score here, let’s say less than 70% for the sake of argument) then the user is prompted. And of course, no user wants to be prompted about code pages.
My grandmother doesn’t know anything about code pages. You simply show some UI with a short preview of text for all the probable unicode options and it’s easy for them to choose which looks right by clicking on it. This is better than getting it wrong and displaying garbage, and allowing morons to come up with conspiracy theories. If mixed file encodings are a regular problem while editing (programmatically it’s a whole different problem) then switch to another editor which does allow encodings to be switched easily (I use Notepad2, despite a couple of crash bugs) many others exist — I also replace Notepad.exe with the Notepad2.exe). And people calling for Notepad to recognise all line endings 0x0A, 0x0D, 0x0A+0x0D, 0x0D+0x0A — what should notepad do when a file contains mixed endings or new lines are added to the document? Should Save always change a file to DOS line endings? I have occasionally used Notepad to edit.exe files — which is one very good thing about Notepad, what comes in goes out verbatim.
All of a sudden there are backwards compatibility issues. 'Sorry, I assume people do some basic research before asking a question.' Yes, I’d assume that too. But I wasn’t asking anything. I was just pointing out something, and wasn’t aware that I had missed a memo. 'If you’re going to complain about Notepad’s UCS-2 support, you probably should know what UCS-2 is.'
I’m not complaining at all! Why would I complain about notepad anyway? Just because I could? Maybe I could please you by trying to be a nitpicker, but I’m not bothering.
As others have pointed out already, a coder always happily installs his/her favorite editor that makes notepad look like a clay tablet. Notepad is a fine clay tablet, there’s still a use for it sometimes. Really, the problem is that there’s not really a concept of a 'plain text' file.Any.
file needs additional information (such as that given in the HTTP Content-Type header) in order for an application to know what the bytes are supposed to mean. Windows tries to encode that information in the file’s extension, and does a good job 99% of the time, but it’s really not enough unless we can get people to name their files file.txt.UTF-8, file.txt.windows-1252, and so on (and we add corresponding application support). A UTF-8 text file is a.different. type of file than a windows-1252 text file, just like they’re both different from an HTML or Microsoft Word file.
Nick – that does sound bizarre, especially from the links posted. The oddest report of all is when it appears to output the characters that are the Windows-1252 interpretations of the UTF-8 BOM bytes – makes it sound like IE is interpreting /part/ of a UTF-8 stream as Win-1252?!? WRT the server-side removing the BOM, I’d say that it should strip a BOM (if present) from.all. of the files it reads. It should be sending a 'content-type' HTTP header back with the correct charset for the rest of the page, so none of the BOMs will be necessary in the output. As for the charset/encoding distinction – most of the time it does not matter.
A character encoding implicitly defines a character set; it is the set of characters expressible by the encoding. Similarly, a character set is nearly always defined in terms of a specific encoding – even Unicode is defined in terms of a specific encoding – UCS-4. Unicode is just different from most other charsets in that it defines a number of encodings that can express it. Also, HTML doesn’t have to be in unicode – it’s perfectly acceptable to store it in Windows-1252 and have 'Content-type: text/html; charset=Windows-1252' as an HTTP header or HTML META tag. And Windows-1252 is as much a character set as it is an encoding. I think I do stand corrected as to my blanket 'HTML is a Unicode charset' statement.
I think what confuses the pants off me is that you can take an HTML file, save it as iso-8859-1, and then shove in an HTML entity reference relating to a Unicode code point. Like take some crazy character that’s not in iso-8859-1, like 'Upwards Double Arrow' (↟). 8607 corresponds to a Unicode code point that the browser looks up, but the actual &, #, 8, 6, 0, and 7 characters were transmitted to the browser in Windows Latin 1. That 8607 never changes regardless of the text file’s encoding because 'that part' of HTML is Unicode. That popping noise was my brain exploding. If server-side-includes are printing BOMs into the output, then it’s the SSI code that’s broken, not whatever app is trying to parse it (although it should be able to tolerate it, under the '70% of the web is broken' rule).
Imagine including three files, one of which is UTF-8, another UTF-16, and a third is ANSI with a Cyrillic codepage. Whoever is constructing the output is responsible for ensuring the result is in a single unified format (so it can be specified in the headers), so they’ll have to perform the necessary character conversion before outputting it. In an ideal world, at least. Of course in practice most don’t give a damn and just assume that everything is in the same encoding. But that’s not technically correct behaviour.
I just figured out why my JSP was not transforming an XML file (using either a DTD or Schema)! Kept getting error message about having data in the XML head.
Of course: the BOM. The insanity when a dotNET nut is given the assignment to do Java! So I guess the Xerces/Xanal XML parser methods in Java can’t deal with a real UTF-8 encoded XML with a BOM. That’s real convenient.
Crikey, can’t even get it in to filer. And NetBeans 5.5 uses UTF-8 as default. Well, it is nearly 2am so I’m glad I took a break to come here. Gotta get up in 4 hours.
Xml Files Open With Wrong Encoding Textastic For Mac
At least I solved the problem. Just ignore this. IE has its own heuristics about guess, since a lot of pages neglect to specify their charset properly (in Content-Type header of the equivalent, and of course full-Unicode charsets like UTF-8.
But when it doesn’t, most users have learned to change the encoding to everything with 'Hebrew' in it until it looks OK. I’ve seen many otherwise non-sophisticated users accomplish this. I wish Notepad has this option (and not only during open). BTW, using 'Unicode' when meaning 'UTF-16 LE' is a historical remnant from the time where Unicode thought '16-bit should be enough for everyone!'
, which is about the formative years of Windows NT and Win32 (I think, don’t quote me on that). So I guess the Xerces/Xanal XML parser methods in Java can’t deal with a real UTF-8 encoded XML with a BOM. The Java XML parsers can cope with a BOM, but you have to present them with an unparsed byte sequence. Most likely, you have unwittingly converted the bytes into a String (using some erroneous auto-guessed encoding like iso-latin-1), and presented the resulting string (which contained a 2-characters sequence 0x00FE 0x00FF (or 0x00FF 0x00FE) instead of a one-character preamble 0xFEFF (or 0xFFFE). You already have the option to specify the encoding manually: Ctrl+O.
The options presented are ANSI (Shift-JIS, in which a very small minority of the characters are 8 bits), and three varieties of Unicode. If a file contains European characters then some versions of Word sometimes provide more possibilities and even provide previews, but Notepad doesn’t. If I choose any Notepad option the result will be garbage. Why does Notepad call an API that has such a heavy bias towards guessing a usually wrong encoding?
Most files are Shift-JIS, a few are other ANSI encodings (some of which are 8-bit encodings), and a few are Unicode. If there’s no obvious indicator then the first attempt should be the user’s ANSI code page, and if that doesn’t work then the second attempt should be the system’s ANSI code page, and Unicode should come after that. Tuesday, April 17, 2007 2:25 PM by John Sometimes I wonder how different it would be if 20 years ago everybody knew the problems we had today.
Everybody did, except for one country. To explain the references to HTML being Unicode earlier, the situation is that the HTML standard is (these days) explicitly Unicode. Thus when you send a Windows 1252 HTML file, the HTML standard considers that as a Unicode document which merely happens to be encoded as Windows 1252, just as other Unicode documents might be encoded as UTF-8 or UCS-2. There are a great many corner cases that are simplified by this assertion. The practical consequence is that you must program for the web using Unicode, because it’s virtually impossible to implement the standard correctly without doing so.
Most web browsers and similar tools convert to their internal Unicode representation (usually UTF-8 or native endian UTF-16) during or soon after downloading. IsTextUnicode is EXACTLY how notepad does its auto-detection Dean, what a completely false statement! As far as I can tell, if IsTextUnicode is used at all, it is ONLY used to detect the UTF-16 BOM. It doesn’t even detect a UTF-8 BOM! And it doesn’t have UTF-8 byte sequence auto-detection. None of the IsTextUnicode UTF-16 statistical likeness detection is used because that would give notepad a behavior that could be deemed inconsistent.
All in all, notepad makes the right choices, but IsTextUnicode has little to do with it. Thanks for responding to say re-read.
You did touch the important points, but I guess I’m adding: a) notepad’s UTF-8 auto-detection is not from IsTextUnicode. I guess you were referring to this auto-detection where you said 'looks like valid UTF-8' b) UTF-8 auto-detection is.much.
more reliable (despite the example I gave which breaks it), so it can be seriously considered, whereas UTF-16 auto-detection should not be even remotely considered, especially since the BOM is always recommended for UCS-2 and UTF-16. @cmov: 'That doesn’t help you when you save the file and try to open it.' But as a nice bonus: the Visual Studio 2005 editor does. And it also saves the file according to the meta. @d.n.hotch: 'Hope for something ASCII-compatible? You might need the encoding before you get the encoding.' All the popular code pages are ASCII-compatible enough for you to recognize the meta.
Exceptions are EBCDIC (and using EBCDIC in an HTML is a WTF:-), and UTF-16/UTF-32 For UTF16/32 the recommendation is to use a BOM:.