On Linux you sometimes you get a “�” in a file name and a trailing “(invalid encoding)” in the filename. This is something that can happen when moving files from Windows to Ubuntu Linux. When uploading files to a Linux box you basically need two Linux tools to “repair” any incompatibility: “convmv” and “dos2unix”. The following commands will install them (on a Debian based Linux):
1
2
|
sudo apt-get install convmv sudo apt-get install dos2unix |
Character encoding
To remove the “(invalid encoding)” you use the “convmv” tool. It is a tool that will convert the character encoding used in the file name. You can try the conversion of file names from different character set to UTF-8 using the following commands:
1
2
3
|
convmv -r -f windows-1252 -t UTF-8 . convmv -r -f ISO-8859-1 -t UTF-8 . convmv -r -f cp-850 -t UTF-8 . |
These are the three most popular character encodings (for Western Europe). If you need another character encoding use the “locale -m” command for a full list of options. Check out the Wikipedia character encoding page to find the characteristics of each of them. After you confirmed that the conversion is correct you can run the actual conversion by adding the “notest” flag. A typical run would look like this (use “-r” for recursive):
1
2
3
4
5
6
7
8
9
|
$ convmv -r -f windows-1252 -t UTF-8 . Your Perl version has fleas #37757 #49830 Starting a dry run without changes... mv "./jag f�rst�r inte.txt" "./jag förstår inte.txt" No changes to your files done. Use --notest to finally rename the files. $ convmv -r -f windows-1252 -t UTF-8 . --notest Your Perl version has fleas #37757 #49830 mv "./jag f�rst�r inte.txt" "./jag förstår inte.txt" Ready! |
Line endings
Different operating systems have different line endings. The line endings are marked by one or two ASCII characters. These are the common styles:
- CRLF: for the DOS\Windows world
- CR: for the pre-OSX Mac world
- LF: for the Unix and Unix-like world (including OSX)
Where the CR and LF characters are defined as such:
- CR: Carriage Return is ASCII character 13 (0x0D)
- LF: Line Feed is ASCII character 10 (0x0A)
To detect what line endings a file has you can use “vi” and look for ^M (control-M) characters:
1
2
3
4
5
6
|
$ vi jag\ förstår\ inte.txt Do you understand IT?^M Yes I do!^M ~ ~ "jag förstår inte.txt" 2 lines, 34 characters |
Or you can use the “file” command:
1
2
3
4
5
6
|
$ file jag\ förstår\ inte.txt jag förstår inte.txt: ASCII text, with CRLF line terminators $ dos2unix jag\ förstår\ inte.txt dos2unix: converting file jag förstår inte.txt to Unix format ... $ file jag\ förstår\ inte.txt jag förstår inte.txt: ASCII text |
To do the conversion of line endings from Windows to Linux
1
2
3
4
5
6
|
$ file jag\ förstår\ inte.txt jag förstår inte.txt: ASCII text, with CRLF line terminators $ dos2unix jag\ förstår\ inte.txt dos2unix: converting file jag förstår inte.txt to Unix format ... $ file jag\ förstår\ inte.txt jag förstår inte.txt: ASCII text |
Alternatively you can use an editor that supports conversion of line endings. Examples of open source text editors that support conversion of line endings are:
- “TextMate” on OSX
- “Notepad++” on Windows
- “Gedit” on Ubuntu Linux
When I committed files from my OSX laptop to the Git repo, the “git diff” command showed way too many lines (since the line endings were changed). My colleagues showed me how to use the above commands to avoid any problems.