Character Encoding: Difference between revisions

From FileZilla Wiki
Jump to navigationJump to search
m (Reverted edits by 2600:387:2:803:0:0:0:72 (talk) to last revision by CodeSquid)
Tag: Rollback
 
(48 intermediate revisions by 26 users not shown)
Line 1: Line 1:
== Overview ==
== Overview ==


FTP is a rather old protocol and things we take for granted now were not even considered 20 years ago. One of these things is support for non-English characters. When the FTP protocol was designed, computers mostly spoke English and were unable to display any non-English characters like umlauts, accented letters or even completely different scripts like for example Chinese.
Users sometimes encounter problems with FTP transfers that garble non-English characters in filenames, such as umlauts, accented letters or completely different scripts like Chinese or Arabic.
As such, the FTP protocol has been designed to be used with English characters only, namely 7-bit ASCII.


The problem is that many FTP clients and servers purposely violate the FTP specifications in order to support other, non-standard character sets. Which of these character sets are used is not subject to any negotiation and completely arbitrary. For any character set in existence, you can find a server using it with no way of detecting the proper encoding.
FTP is a rather old protocol and things we take for granted now were not even considered when it was designed. One of these things is support for non-English characters in filenames. When the FTP protocol was designed, computers mostly spoke English and were unable to display any non-English characters. As such, the FTP protocol was designed to be used with English characters only, namely 7-bit ASCII.


To solve this problem, the FTP protocol has been extended in a backwards compatible way to use UTF-8 as character set, which is a strict superset of the previously used character set. Note that this obviously can only be backwards compatible with servers in compliance with the original specifications.
The problem is that many FTP clients and servers purposely violate the FTP specifications in order to support other, non-standard character sets. Which of these character sets are used is not subject to any negotiation. For any character set in existence, you can find a server using it with no way of detecting the proper encoding. The result: non-English characters are not transferred correctly.
 
To solve this problem, the FTP protocol has been extended in a backwards compatible way to use UTF-8 as the character set. (This solution is backwards compatible only with servers in compliance with the original specifications.)


If you have problems with filenames containing any foreign characters, this can have two reasons:
If you have problems with filenames containing any foreign characters, this can have two reasons:
* The server or client follows the original specifications by the letter and rightfully rejects those filenames
* The server or client follows the original specifications by the letter and rightfully rejects those filenames
* The server or client violates the specifications and uses a custom encoding
* The server or client violates the specifications and uses a custom encoding that the other party does not understand


Note that both FileZilla Client and Server are fully compliant with the updated specifications and use UTF-8.
Both FileZilla Client and Server are fully compliant with the updated specifications and use UTF-8. FileZilla will not break FTP specifications by supporting non-standard encodings in order to accommodate the user.


If you have problems with other clients or servers, please upgrade to FTP software capable of UTF-8 or refrain from using foreign characters. Anything else is in violation to the FTP specifications and does not work.
If you have problems with other clients or servers, please upgrade (or ask the server to upgrade) to FTP software capable of UTF-8 or refrain from using foreign characters. Anything else is in violation of the FTP specifications and will only work if you manually ensure that the server and client use the same character encoding (which may not even be possible).


== Technical details ==
== Technical details ==


idk what to do so ha!!!!!
The FTP protocol is specified in [http://filezilla-project.org/specs/rfc0959.txt <nowiki>RFC 959</nowiki>], which was published in 1985. The FTP protocol is designed on top of the original Telnet protocol, which is specified in [http://filezilla-project.org/specs/rfc0854.txt <nowiki>RFC 854</nowiki>]. The relevant sections of the Telnet specification regarding FTP are those covering the Network Virtual Terminal (NVT).
In order to support non-English characters, the FTP specifications have been extended 1999 in [http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>]. This extension requires the use of UTF-8 as character set. This character set is a strict superset of ASCII, every valid ASCII character is also the same character in UTF-8. The UTF-8 character set can display any valid Unicode character. That includes umlauts, accented letters and also different scripts.
According to [http://filezilla-project.org/specs/rfc0854.txt RFC 854], the NVT requires the use of (7-bit) ASCII as the character set. Use of any other character set requires explicit negotiation.  This character set only contains 127 different characters: English letters and numbers, punctuation characters and a few control characters. Accented letters, umlauts or other scripts are not contained in the ASCII character set.
This extension is fully backwards compatible. As long as you're not using any non-English characters, it doesn't matter if the used software supports RFC 2640 or not. Note that if you used non-English characters before using RFC 2640 compatible software, there will be problems. Problems which are entirely self-made by not obeying the specifications.
 
In order to support non-English characters, the FTP specifications were extended in 1999 in [http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>]. This extension requires the use of UTF-8 as the character set. This character set is a strict superset of ASCII, every valid ASCII character is also the same character in UTF-8. The UTF-8 character set can display any valid Unicode character. That includes umlauts, accented letters and also different scripts.
This extension is fully backwards compatible with [http://filezilla-project.org/specs/rfc0959.txt <nowiki>RFC 959</nowiki>].
 
As long as you're using only English characters, it doesn't matter if the software you are using supports RFC 2640 or not. However, if you use non-English characters without using RFC 2640 compatible software, there will be problems--problems which are entirely self-made by not obeying the specifications.


=== UTF8 feature negotiation ===
=== UTF8 feature negotiation ===
Line 34: Line 39:
=== Conflicting specification ===
=== Conflicting specification ===


There exists a long expired [http://tools.ietf.org/html/draft-ietf-ftpext-utf-8-option-00 IETF draft] that is in conflict to [http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>]. This draft also requires the FEAT response to include UTF8, but in addition requires the client to send '''OPTS UTF-8 ON''' to enable UTF-8 support.
There exists a long expired [http://tools.ietf.org/html/draft-ietf-ftpext-utf-8-option-00 IETF draft] that is in conflict with [http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>]. This draft also requires the FEAT response to include UTF8, but in addition requires the client to send '''OPTS UTF-8 ON''' to enable UTF-8 support.


If an [http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>] compliant client sends '''OPTS UTF-8 ON''', it has to use UTF-8 regardless whether '''OPTS UTF-8 ON''' succeeds or not.
If an [http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>] compliant client sends '''OPTS UTF-8 ON''', it has to use UTF-8 regardless whether '''OPTS UTF-8 ON''' succeeds or not.
Line 40: Line 45:
[http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>] compliant servers ''must not'' make UTF-8 dependent on '''OPTS UTF-8 ON'''.
[http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>] compliant servers ''must not'' make UTF-8 dependent on '''OPTS UTF-8 ON'''.


== SFTP ==
=== SFTP ===


The situation for SFTP is similar to the one of FTP. Current versions of the SFTP specifications require filenames to be encoded as UTF-8, beginning with version 4 of the SFTP specifications.
The situation for SFTP is similar to the one for FTP. Current versions of the SFTP specifications (beginning with version 4) require filenames to be encoded as UTF-8.


However, the most commonly used SFTP protocol version is version 3 as implemented in OpenSSH. This version of the SFTP specifications does not require UTF-8. In fact it does not say anything about the encoding.
However, the most commonly used SFTP protocol version is version 3 as implemented in OpenSSH. This version of the SFTP specifications does not require UTF-8. In fact it does not say anything about the encoding.
It is however reasonable to assume UTF-8 on those servers anyhow for the following reasons:
It is however reasonable to assume UTF-8 on those servers for the following reasons:
* The later protocol versions require UTF-8
* The later protocol versions require UTF-8
* The SSH protocol, under which SFTP operates, already requires UTF-8
* The SSH protocol, under which SFTP operates, already requires UTF-8

Latest revision as of 07:40, 12 October 2023

Overview[edit]

Users sometimes encounter problems with FTP transfers that garble non-English characters in filenames, such as umlauts, accented letters or completely different scripts like Chinese or Arabic.

FTP is a rather old protocol and things we take for granted now were not even considered when it was designed. One of these things is support for non-English characters in filenames. When the FTP protocol was designed, computers mostly spoke English and were unable to display any non-English characters. As such, the FTP protocol was designed to be used with English characters only, namely 7-bit ASCII.

The problem is that many FTP clients and servers purposely violate the FTP specifications in order to support other, non-standard character sets. Which of these character sets are used is not subject to any negotiation. For any character set in existence, you can find a server using it with no way of detecting the proper encoding. The result: non-English characters are not transferred correctly.

To solve this problem, the FTP protocol has been extended in a backwards compatible way to use UTF-8 as the character set. (This solution is backwards compatible only with servers in compliance with the original specifications.)

If you have problems with filenames containing any foreign characters, this can have two reasons:

  • The server or client follows the original specifications by the letter and rightfully rejects those filenames
  • The server or client violates the specifications and uses a custom encoding that the other party does not understand

Both FileZilla Client and Server are fully compliant with the updated specifications and use UTF-8. FileZilla will not break FTP specifications by supporting non-standard encodings in order to accommodate the user.

If you have problems with other clients or servers, please upgrade (or ask the server to upgrade) to FTP software capable of UTF-8 or refrain from using foreign characters. Anything else is in violation of the FTP specifications and will only work if you manually ensure that the server and client use the same character encoding (which may not even be possible).

Technical details[edit]

The FTP protocol is specified in RFC 959, which was published in 1985. The FTP protocol is designed on top of the original Telnet protocol, which is specified in RFC 854. The relevant sections of the Telnet specification regarding FTP are those covering the Network Virtual Terminal (NVT). According to RFC 854, the NVT requires the use of (7-bit) ASCII as the character set. Use of any other character set requires explicit negotiation. This character set only contains 127 different characters: English letters and numbers, punctuation characters and a few control characters. Accented letters, umlauts or other scripts are not contained in the ASCII character set.

In order to support non-English characters, the FTP specifications were extended in 1999 in RFC 2640. This extension requires the use of UTF-8 as the character set. This character set is a strict superset of ASCII, every valid ASCII character is also the same character in UTF-8. The UTF-8 character set can display any valid Unicode character. That includes umlauts, accented letters and also different scripts. This extension is fully backwards compatible with RFC 959.

As long as you're using only English characters, it doesn't matter if the software you are using supports RFC 2640 or not. However, if you use non-English characters without using RFC 2640 compatible software, there will be problems--problems which are entirely self-made by not obeying the specifications.

UTF8 feature negotiation[edit]

An RFC 2640 compliant server must support the FEAT command and must include a line containing UTF8 in its response:

Command:  FEAT
Response: 211-Features:
 [...]
Response:  UTF8
 [...]
Response: 211 End

Conflicting specification[edit]

There exists a long expired IETF draft that is in conflict with RFC 2640. This draft also requires the FEAT response to include UTF8, but in addition requires the client to send OPTS UTF-8 ON to enable UTF-8 support.

If an RFC 2640 compliant client sends OPTS UTF-8 ON, it has to use UTF-8 regardless whether OPTS UTF-8 ON succeeds or not.

RFC 2640 compliant servers must not make UTF-8 dependent on OPTS UTF-8 ON.

SFTP[edit]

The situation for SFTP is similar to the one for FTP. Current versions of the SFTP specifications (beginning with version 4) require filenames to be encoded as UTF-8.

However, the most commonly used SFTP protocol version is version 3 as implemented in OpenSSH. This version of the SFTP specifications does not require UTF-8. In fact it does not say anything about the encoding. It is however reasonable to assume UTF-8 on those servers for the following reasons:

  • The later protocol versions require UTF-8
  • The SSH protocol, under which SFTP operates, already requires UTF-8
  • Even in version 3 of the protocol, some parts of the protocol already use UTF-8
  • The native character set on most modern Unix(-like) operating systems is UTF-8

In essence this means that everywhere where SFTP is available, the necessary infrastructure to use UTF-8 is in place.