#891 UTF-8 encoded files aren't displayed correctly in tree/file view
Closed: Fixed 7 years ago Opened 8 years ago by nphilipp.

Comparing RPM changelog authorship lines in tree/file view and the corresponding commit diff, noticed that UTF-8 encoded characters are displayed wrongly in the tree/file view (and correctly in a diff). In this case the former lists the author's last name as "Å abata" when it should be "Šabata".


hm, I can't reproduce this locally, so I'm going to wait for the next release since it brings quite a number of changes and let's check after it.

hm, this is most annoying, I can't reproduce it locally but the new pagure doesn't fix it.

Question: in which local is this file saved in the git? I'm wondering if the user's local could be something
different than UTF-8 which would make the file stored in a little bit different way.

I've just checked it in my local clone of the repo, and it's UTF-8 encoded.

...and how it's displayed makes me think it misinterprets it as ISO-8859-1: the UTF8-code for 'Š' is C5A0, in ISO-8859-1, C5 is 'Å' and A0 is the non-breaking space.

You should be able to look at the headers for the returned web page to see which character set is being used. That might help narrow down where the problem is.

The web page is sending utf-8. There doesn't seem to be any change of character set in the generated web page. I wonder if some sort of file/charset conversion got done on the file, breaking it.

Another data point is that getting the raw version, which also gets sent as utf-8, gets displayed correctly. So now it looks like the web server might be running some sort of conversion process on the file before displaying it. https://pagure.io/fm-howdoi-module/raw/77d0d996c50de461a51b21f0d4df924cd84d37fc/f/howdoi.spec

The file is definitely get converted somewhere. Perhaps whatever python code makes thing pretty for display has a charset issue which results in a type conversion?

I think you need to do something special when you read in files that are unicode.
The following page has some info about python2 and unicode:
http://www.evanjones.ca/python-utf8.html

There seems to be a more general problem though. It doesn't look like git has a way to track the locale of files. This makes some sense, since for committing changes you don't typically want conversions.
But for displaying files properly you do need to know what locale to use, unless you are going to require everything to be utf-8. While things are moving that way, it may not be appropriate for all projects.
I don't know that there would be a perfect way to track this metadata in Pagure. I think there are programs that can try to guess the locale used in a file. I am not sure how practical these would be to use when displaying files.

It looks like uchardet or python-chardet could be used to try to guess a file's encoding before displaying a file on a web page.
This should be a separate issue for a feature request, as I don't think it is a solution for this bug.

It looks like uchardet or python-chardet could be used to try to guess a file's encoding before displaying a file on a web page.
This should be a separate issue for a feature request, as I don't think it is a solution for this bug.

I actually think this is related, I implemented it in: https://pagure.io/pagure/pull-request/1012

Ok so the commit diff looks good but the file still not :(

I think there were two different problems and your fix was for the other one. The input file in this case is having its data mixed with some constant strings and I don't think that gets handled the way we want in python 2.
This is one I was hoping to test with a local instance that I haven't set up yet.

An example file that Pagure currently displays with mojibake is this one: https://pagure.io/python-daemon/blob/master/f/daemon/_metadata.py

Rather than guessing encodings (which can fail), I think Pagure should simply default to UTF-8.

Just noticed this repeated regression, too (see, e.g., #494):
https://pagure.io/clufter/blob/b4b12614e1d243c0ed21b3be7896a153fb02d219/f/facts.py#_6

__author__ = "Jan Pokorný <jpokorny @at@ Red Hat .dot. com>"

I am definitely using UTF-8 in my environment (unless there would be some bad bug in vim/git/vte/bash/...).

The Problems

So here's the series of events that leads us to the problem in my development environment, which is running Flask's development server and using HTTP/1.0:

  1. mimetypes.guess_type fails to guess the type and encoding of the file

  2. the '\0' byte is not present in the file, so it's classified as 'text/plain' as apposed to 'application/octet-stream'

  3. chardet is used to guess the character encoding. It guesses ISO-8859-2 with 78% confidence because there isn't a BOM in the file (and therefore it can't be certain it's utf-8). Its second guess is utf-8 with 75% confidence. Womp womp.

  4. The Content-Encoding header is set to ISO-8859-2. Based on my interpretation of the HTTP/1.1 RFC[0] this is incorrect. The Content-Encoding header is used to indicate the content has been encoded in some fashion (gzip, base64, etc) and after it is decoded the Content-Type header will hold true.

  5. Things now get dicey. Steps 1-4 don't matter since we didn't set the content encoding in content type. According to RFC2616[1], the original RFC for HTTP/1.1, the default charset is (humorously) ISO-8859-1. However, RFC7231[2] obsoletes RFC2616 and indicates that the default charset is defined by the media type. RFC2016[3] indicates that for text/plain, this should be us-ascii. Therefore, I would expect results to vary/crash and burn depending on what RFC the client honors. It's also worth noting the client is free to ignore the declared type and encoding and attempt to guess the type.

[0] https://tools.ietf.org/html/rfc7231#section-3.1.2.2
[1] https://tools.ietf.org/html/rfc2616#section-3.7.1
[2] https://tools.ietf.org/html/rfc7231#appendix-B
[3] https://tools.ietf.org/html/rfc2046#section-4.1.3

Note that the production deployment is being run by Apache, and I suspect (but have not yet verified) that it is responsible for setting the Content-Type: text/plain; charset=UTF-8 result I'm seeing production return that differs from my development environment.

The Possible Solutions

Ignoring for a moment the incorrect use of Content-Encoding, there is still a problem. From my reading of the pygit2 documentation[0] and my understanding of git, there's no way to know the encoding a file is using from a blob. We can only make guesses. So here are the options I've thought of:

  1. Keep accepting whatever chardet guesses and just use it.

  2. Keep using chardet, but if it contains utf-8 as a potential encoding, prefer it over other encodings.

  3. This is a crazy idea, but we could inspect the commit that introduced the file and since it does have a character encoding associated with it, we could default to that since it's likely that the file is encoded using the same character encoding (right?).

I think 1 or 2 is the most reasonable, but I lean towards 2 since I'm biased in my utf-8 filled world. No matter what we do, there are cases when we can guess incorrectly.

Edit: I should also clarify, this is for viewing the "raw" file, which I consider the "simple" case of detecting the file encoding and type, and then just shoveling the raw file out to the client.

Content encoding guessing is a very hard problem that is worth putting on your résumé if you solve it, especially within text that is almost ascii, like this particular sentence.

I have a general feeling that UTF-8 is rising in popularity still and is widely used[citation-needed]. One nice thing about UTF-8 (as opposed to latin-1) is that is is a bit (get it?) particular about certain bits within the bytes, which means that it sometimes fails to decode characters that were encoded with other encodings. To be clear, this doesn't mean that a successful decode meant the original string was UTF-8, but a failure to decode does mean it wasn't (or that your data is corrupted, but that's a different discussion).

Due to that particularity, I'd be inclined towards the second option you listed above. Another option that occurred to me when considering your ideas is that we could do what you are suggesting, but give an unfair boost to the confidence of the UTF-8 that chardet gives us. For example, if chardet gave us a 78% guess that it's latin-1 and a 73% guess that it's utf-8, and we arbitrarily gave utf-8 an unfair 10% advantage, bam, now we think it's UTF-8. I'm assuming chardet does try to encode with utf-8 and returns no mention of utf-8 if it can't be encoded that way, in which case we would just ignore it.

In general, this is an extremely difficult problem and I don't think we will find a "perfect" solution for it. I think you are proposing "good" solutions to the problem, and sometimes you just have to go with "good enough" ☺ More advanced systems can use machine learning and natural language processing techniques to look at the resulting text to determine which looks more like "human text", but even that isn't reliable due to amazing authors like Lewis Carroll, who wrote the Jabberwocky which isn't even English but somehow sure does look like it!

On Thu, 2016-10-13 20:25 +0000, pagure@pagure.io wrote:

So here's the series of events that leads us to the problem in my
development environment [=E2=80=A6]

Thank you for the excellent summary of a thorough investigation.

We can only make guesses. So here are the options I've
thought of: [=E2=80=A6]
=20
2. Keep using chardet, but if it contains utf-8 as a potential encoding,
prefer it over other encodings.

I would recommend:

  • Assume =E2=80=98utf-8=E2=80=99 by default.

Ask =E2=80=98chardet=E2=80=99 its opinion; only if the response is certai=
nty[0] of a
different encoding use that; otherwise stick to =E2=80=98utf-8=E2=80=99 a=
s default.

[0]: or whatever level of confidence is deemed to overrule the =E2=80=9Cass=
ume
=E2=80=98utf-8=E2=80=99 by default=E2=80=9D. It should be a very high lev=
el of confidence to
use a non-default encoding.

--=20
\
`\
_o__) Ben Finney ben+fedoraproject@benfinney.id.au

Keep using chardet, but if it contains utf-8 as a potential encoding, prefer it over other encodings.

This sounds like a good approach, I am also biased toward the UTF-8 world and while we can't assume everything is, we might be right in doing so most of the time.

So :thumbsup_tone1: for the second solution and worst case we can always re-consider if it turns out to be less effective than thought :)

Looking at the original links, it seems that @jcline's PR #1440 fixed this issue :)

@pingou changed the status to Closed

7 years ago

Now, I observe instead:

__author__ = "Jan PokornĂ˝ <jpokorny @at@ Red Hat .dot. com>"

@jpokorny yes this will not change as the text there was saved as is, but the two links in the original comment are both showing fine :)

@pingou, can you elaborate on this, please?
Should I expect only the new commits will eventually heal this issue?

The fix should work (or not) regardless of when the commit was made.

The problem is Pagure is incorrectly identifying that file as ISO-8859-2 (you can tell by curling the raw file and looking at the charset portion of the Content-Type header: Content-Type: text/x-python; charset=ISO-8859-2). It decodes that file to a Python unicode object using that character encoding (which is what causes the incorrect charaters), then encodes it to UTF-8 when it renders the HTML. The only thing I can think to do is further increase the bias to UTF-8 we added to chardet, but it'll still be wrong occasionally.

I think increasing the bias would be reasonable.
How many files are expected to be encoded with ISO-8859-2 these days?

As a side note, this encoding is suitable to contain letters of Czech
language, so I had an episode of using that encoding heavily, but
that dates back 10+ years.

I've also noticed that sometimes UTF-8 wins in a very similar scenario:
https://pagure.io/clufter/blob/8d6d8366032362c24a7bacb3d59fd206c89bee1d/f/__main__.py

What would also be possible, though I have no idea about the performance
impact, would be to add additional heuristics, e.g. for the Python files
(which happens to be the pretty frequented in the discussed case) if the
encoding is explicitly stated (which it is here).

On 29-Nov-2016, =3D?utf-8?q?Jan_Pokorn=3DC3=3DBD_=3D3Cpagure=3D40pagure=3D2=
Eio=3D3E?=3D@pagure.io wrote:

I think increasing the bias would be reasonable.

What do you think of the earlier suggestion: If =E2=80=9CUTF-8=E2=80=9D app=
ears in the
set of suggested encodings, use that.

--=20
\ =E2=80=9CTo succeed in the world it is not enough to be stupid, =
you |
`\ must also be well-mannered.=E2=80=9D =E2=80=
=94Voltaire |
_o__) |
Ben Finney ben@benfinney.id.au

@bignose Also a good option.
Any kind of UTF-8 auto-inclination should not hopefully be considered violent for quite some time already :-)

Login to comment on this ticket.

Metadata