Character Encoding: Difference between revisions

From FileZilla Wiki
Jump to navigationJump to search
Tag: Replaced
Line 2: Line 2:


== Technical details ==
== Technical details ==
The FTP protocol is specified in [http://filezilla-project.org/specs/rfc0959.txt <nowiki>RFC 959</nowiki>], which was published in 1985. The FTP protocol is designed on top of the original Telnet protocol, which is specified in [http://filezilla-project.org/specs/rfc0854.txt <nowiki>RFC 854</nowiki>]. The relevant sections of the Telnet specification regarding FTP are those covering the Network Virtual Terminal (NVT).
According to [http://filezilla-project.org/specs/rfc0854.txt RFC 854], the NVT requires the use of (7-bit) ASCII as the character set. Use of any other character set requires explicit negotiation.  This character set only contains 127 different characters: English letters and numbers, punctuation characters and a few control characters. Accented letters, umlauts or other scripts are not contained in the ASCII character set.
In order to support non-English characters, the FTP specifications were extended in 1999 in [http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>]. This extension requires the use of UTF-8 as the character set. This character set is a strict superset of ASCII, every valid ASCII character is also the same character in UTF-8. The UTF-8 character set can display any valid Unicode character. That includes umlauts, accented letters and also different scripts.
This extension is fully backwards compatible with [http://filezilla-project.org/specs/rfc0959.txt <nowiki>RFC 959</nowiki>].
As long as you're using only English characters, it doesn't matter if the software you are using supports RFC 2640 or not. However, if you use non-English characters without using RFC 2640 compatible software, there will be problems--problems which are entirely self-made by not obeying the specifications.


=== UTF8 feature negotiation ===
=== UTF8 feature negotiation ===
An [http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>] compliant server ''must'' support the FEAT command and ''must'' include a line containing UTF8 in its response:
Command:  FEAT
Response: 211-Features:
  [...]
Response:  UTF8
  [...]
Response: 211 End


=== Conflicting specification ===
=== Conflicting specification ===
There exists a long expired [http://tools.ietf.org/html/draft-ietf-ftpext-utf-8-option-00 IETF draft] that is in conflict with [http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>]. This draft also requires the FEAT response to include UTF8, but in addition requires the client to send '''OPTS UTF-8 ON''' to enable UTF-8 support.
If an [http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>] compliant client sends '''OPTS UTF-8 ON''', it has to use UTF-8 regardless whether '''OPTS UTF-8 ON''' succeeds or not.
[http://filezilla-project.org/specs/rfc2640.txt <nowiki>RFC 2640</nowiki>] compliant servers ''must not'' make UTF-8 dependent on '''OPTS UTF-8 ON'''.


=== SFTP ===
=== SFTP ===
The situation for SFTP is similar to the one for FTP. Current versions of the SFTP specifications (beginning with version 4) require filenames to be encoded as UTF-8.
However, the most commonly used SFTP protocol version is version 3 as implemented in OpenSSH. This version of the SFTP specifications does not require UTF-8. In fact it does not say anything about the encoding.
It is however reasonable to assume UTF-8 on those servers for the following reasons:
* The later protocol versions require UTF-8
* The SSH protocol, under which SFTP operates, already requires UTF-8
* Even in version 3 of the protocol, some parts of the protocol already use UTF-8
* The native character set on most modern Unix(-like) operating systems is UTF-8
In essence this means that everywhere where SFTP is available, the necessary infrastructure to use UTF-8 is in place.

Revision as of 00:35, 12 October 2023

Overview

Technical details

UTF8 feature negotiation

Conflicting specification

SFTP