File name has � (invalid encoding) and CRLF issues

On Linux you sometimes you get a “�” in a file name and a trailing “(invalid encoding)” in the filename. This is something that can happen when moving files from Windows to Ubuntu Linux. When uploading files to a Linux box you basically need two Linux tools to “repair” any incompatibility: “convmv” and “dos2unix”. The following commands will install them (on a Debian based Linux):

1
2
sudo apt-get install convmv
sudo apt-get install dos2unix

Character encoding

To remove the “(invalid encoding)” you use the “convmv” tool. It is a tool that will convert the character encoding used in the file name. You can try the conversion of file names from different character set to UTF-8 using the following commands:

1
2
3
convmv -r -f windows-1252 -t UTF-8 .
convmv -r -f ISO-8859-1 -t UTF-8 .
convmv -r -f cp-850 -t UTF-8 .

These are the three most popular character encodings (for Western Europe). If you need another character encoding use the “locale -m” command for a full list of options. Check out the Wikipedia character encoding page to find the characteristics of each of them.  After you confirmed that the conversion is correct you can run the actual conversion by adding the “notest” flag. A typical run would look like this (use “-r”  for recursive):

1
2
3
4
5
6
7
8
9
$ convmv -r -f windows-1252 -t UTF-8 .
Your Perl version has fleas #37757 #49830
Starting a dry run without changes...
mv "./jag f�rst�r inte.txt"    "./jag förstår inte.txt"
No changes to your files done. Use --notest to finally rename the files.
$ convmv -r -f windows-1252 -t UTF-8 . --notest
Your Perl version has fleas #37757 #49830
mv "./jag f�rst�r inte.txt"    "./jag förstår inte.txt"
Ready!

Line endings

Different operating systems have different line endings. The line endings are marked by one or two ASCII characters. These are the common styles:

  • CRLF: for the DOS\Windows world
  • CR: for the pre-OSX Mac world
  • LF: for the Unix and Unix-like world (including OSX)

Where the CR and LF characters are defined as such:

  • CR: Carriage Return is ASCII character 13 (0x0D)
  • LF: Line Feed is ASCII character 10 (0x0A)

To detect what line endings a file has you can use “vi” and look for ^M (control-M) characters:

1
2
3
4
5
6
$ vi jag\ förstår\ inte.txt
Do you understand IT?^M
Yes I do!^M
~
~
"jag förstår inte.txt" 2 lines, 34 characters

Or you can use the “file” command:

1
2
3
4
5
6
$ file jag\ förstår\ inte.txt
jag förstår inte.txt: ASCII text, with CRLF line terminators
$ dos2unix jag\ förstår\ inte.txt
dos2unix: converting file jag förstår inte.txt to Unix format ...
$ file jag\ förstår\ inte.txt
jag förstår inte.txt: ASCII text

To do the conversion of line endings from Windows to Linux

1
2
3
4
5
6
$ file jag\ förstår\ inte.txt
jag förstår inte.txt: ASCII text, with CRLF line terminators
$ dos2unix jag\ förstår\ inte.txt
dos2unix: converting file jag förstår inte.txt to Unix format ...
$ file jag\ förstår\ inte.txt
jag förstår inte.txt: ASCII text

Alternatively you can use an editor that supports conversion of line endings. Examples of open source text editors that support conversion of line endings are:

  • “TextMate” on OSX
  • “Notepad++” on Windows
  • “Gedit” on Ubuntu Linux

When I committed files from my OSX laptop to the Git repo, the “git diff” command showed way too many lines (since the line endings were changed). My colleagues showed me how to use the above commands to avoid any problems.

Source: https://labs-origin.leaseweb.com/labs/2013/12/file-name-%EF%BF%BD-invalid-encoding/