Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError #64

Open
tohka opened this issue Apr 18, 2020 · 6 comments
Open

UnicodeDecodeError #64

tohka opened this issue Apr 18, 2020 · 6 comments

Comments

@tohka
Copy link

tohka commented Apr 18, 2020

Hi,

An error occurs when selecting a layer that contains Japanese characters in the file path.
The reason is probably because the R script will be generated in UTF-8 but R will try to interpret it as Shift_JIS or CP932 (Japanese encoding).
OS: Windows10 (locale: Japanese)

Sample code

##Test=group
##Layer structure=name
##Layer=vector
str(Layer)

and log

QGIS version: 3.12.0-București
QGIS code revision: cd141490ec
Qt version: 5.11.2
GDAL version: 3.0.4
GEOS version: 3.8.0-CAPI-1.13.1 
PROJ version: Rel. 6.3.1, February 10th, 2020
R version: 2.0.0
Processing algorithm…
Algorithm 'Layer structure' starting…
Input parameters:
{ 'Layer' : 'C:/Users/username/Documents/サンプル.gpkg|layername=サンプル' }

R execution commands
options("repos"="http://cran.at.r-project.org/")
.libPaths("C:/Users/username/AppData/Roaming/QGIS/QGIS3/profiles/default/processing/rlibs")
tryCatch(find.package("sf"), error = function(e) install.packages("sf", dependencies=TRUE))
library("sf")
tryCatch(find.package("raster"), error = function(e) install.packages("raster", dependencies=TRUE))
library("raster")
Layer <- st_read("C:/Users/username/Documents/サンプル.gpkg", layer = "サンプル", quiet = TRUE, stringsAsFactors = FALSE)
str(Layer)

R execution console output
[1] "C:/Users/username/bin/R-Portable/App/R-Portable/library/sf"
Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
[1] "C:/Users/username/AppData/Roaming/QGIS/QGIS3/profiles/default/processing/rlibs/raster"
要求されたパッケージ sp をロード中です
警告メッセージ:
パッケージ 'raster' はバージョン 3.6.3 の R の下で造られました
Traceback (most recent call last):
File "C:/Users/username/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\processing_r\processing\algorithm.py", line 317, in processAlgorithm
output = RUtils.execute_r_algorithm(self, parameters, context, feedback)
File "C:/Users/username/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\processing_r\processing\utils.py", line 258, in execute_r_algorithm
for line in iter(proc.stdout.readline, ''):
UnicodeDecodeError: 'cp932' codec can't decode byte 0xef in position 3: illegal multibyte sequence

Execution failed after 1.39 seconds

Loading resulting layers
Algorithm 'Layer structure' finished

If you fix it as follows, the error will not occur.

--- utils.py.orig	Tue Apr 14 18:55:18 2020
+++ utils.py	Wed Apr 15 09:17:59 2020
@@ -238,6 +238,9 @@
 
         script_filename = RUtils.create_r_script_from_commands(script_lines)
 
+        script_lines = ['options(encoding = "UTF-8")', 'source("%s", encoding = "UTF-8")' % script_filename]
+        script_filename = RUtils.create_r_script_from_commands(script_lines)
+
         # run commands
         command = [
             RUtils.path_to_r_executable(script_executable=True),
@JanCaha
Copy link
Collaborator

JanCaha commented Apr 19, 2020

The issue of encodings in R is really a problematic one. The solution you are suggesting is probably not a good one as setting options(encoding = "UTF-8") can have some unexpected side effects.

Can you try if either of these scripts will work? It tests if you can pass just the path and the second explicitly converts the path to your encoding.

##Test=group ##Layer structure=name ##pass_filenames ##Layer=vector Layer = st_read(Layer, quiet = TRUE, stringsAsFactors = FALSE) str(Layer)

##Test=group ##Layer structure=name ##pass_filenames ##Layer=vector Encoding(Layer) <- "UTF-8" Layer <- enc2native(Layer) Layer = st_read(Layer, quiet = TRUE, stringsAsFactors = FALSE) str(Layer)

@tohka
Copy link
Author

tohka commented Apr 30, 2020

Both of the presented scripts give an error.

Traceback (most recent call last):
File "C:/Users/username/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\processing_r\processing\algorithm.py", line 340, in processAlgorithm
output = RUtils.execute_r_algorithm(self, parameters, context, feedback)
File "C:/Users/username/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\processing_r\processing\utils.py", line 281, in execute_r_algorithm
for line in iter(proc.stdout.readline, ''):
UnicodeDecodeError: 'cp932' codec can't decode byte 0xef in position 3: illegal multibyte sequence

Execution failed after 1.38 seconds

Loading resulting layers
Algorithm 'Layer structure' finished

I ran the following scripts (in ShiftJIS and UTF-8) in R.

x <- "日本語"
print("end of script")
$ rscript test.sjis.R
[1] "end of script"

$ rscript test.utf8.R
Error: invalid multibyte character in parser at line 1
Execution halted

I just assigned a multibyte string and did not evaluate its value, but I got an error. It is expected that an error due to an encoding mismatch occurred when the R interpreter was parsing the script. Therefore, I thought I needed to add an encoding option when loading the script file to solve the problem.

@JanCaha
Copy link
Collaborator

JanCaha commented May 1, 2020

Its all rather strange. I tested the solution you proposed and while it worked fine with non UTF-8 characters from my language (Czech) it cause an error if I used Japanese characters.

Could you, please, test your solution (the changes to utils.py) with layer that would be named: ěščř.gpkg? It is just couple of specific Czech symbols that worked for me.

I think that there might be some R setting causing the problems, most likely the locale.

@tohka
Copy link
Author

tohka commented May 1, 2020

"ěščř" coded by ISO 8859-2 is 0xEC 0xB9 0xE8 0xF8. Strictly interpreted as UTF-8, there are no applicable characters, but the unavailable bytes are not included. It doesn't make an error, but I'm not sure if it works correctly.

"日本語" coded by Shift-JIS (cp932) is 0x93 0xFA 0x96 0x7B 0x8C 0xEA. In UTF-8, 0x8X and 0x9X are not allowed in the most significant byte, so I guess it will be an error.

@JanCaha
Copy link
Collaborator

JanCaha commented May 1, 2020

I would guess that it is related to the issue mentioned here: https://stackoverflow.com/questions/46946483/czech-encoding-in-r. Setting it in .Rprofile file would make it permanent for the system. You can select one of the available code pages from here, unfortunately, the UTFs are not available.

I don't see a way to solve it reasonably.

@JanCaha
Copy link
Collaborator

JanCaha commented May 15, 2020

Looks like I found a solution that might work while not breaking anything.

Could you try changing the r_templates.py in the plugins directory for this version https://github.com/JanCaha/qgis-processing-r/blob/bug_utf-8/processing_r/processing/r_templates.py?

It works only for sf layers for now and it would need a lot of polishing if it should be used, but it seems to be working. What it does is passing the paths encoded from python a interpreting them as utf-8 in R. It works on my computer with Czech language even for Japaneese characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants