Skip to content

Property sun.stdout.encoding is undefined in recent Java versions#407

Open
wfouche wants to merge 12 commits intojython:masterfrom
wfouche:dev/console-utf8
Open

Property sun.stdout.encoding is undefined in recent Java versions#407
wfouche wants to merge 12 commits intojython:masterfrom
wfouche:dev/console-utf8

Conversation

@wfouche
Copy link
Copy Markdown
Member

@wfouche wfouche commented Dec 15, 2025

This PR was tested with a simple script that displays a Unicode character.

C:\...> "activate JDK 8"
C:\...> ant installer

C:\...> chcp 65001
Active code page: 65001

C:\...> java -jar dist\jython-standalone.jar
>>> unicode_string = u"Hello World: \u2764"
>>> print unicode_string
Hello World: ❤

C:\...> "activate JDK 25.0.1"

C:\...> java -jar dist\jython-standalone.jar
>>> unicode_string = u"Hello World: \u2764"
>>> print unicode_string
Hello World: ❤

on Java 8 and 25.

Fixes #404

String output = Py.getCommandResultWindows("chcp");
String output = Py.getCommandResultWindows("chcp.com");
/*
* The output will be like "Active code page: 850" or maybe "Aktive Codepage: 1252." or
Copy link
Copy Markdown
Member

@jeff5 jeff5 Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm explaining why the only reliable bit of the message is the number. ISTR this one is German.

Ja. So stimmt es. https://www.mikrocontroller.net/topic/561065

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted back to Aktive.

@jeff5
Copy link
Copy Markdown
Member

jeff5 commented Dec 15, 2025

As I commented on the original issue, testing console handling is hard because our CI doesn't really exercise it. So it's down to conscientious local testing.

Comment thread src/org/python/core/PySystemState.java Outdated
encoding = props.getProperty("sun.stdout.encoding");
if (encoding != null) {
// Windows: these versions of Java return "cp65001" for UTF-8
if (encoding.equals("cp65001")) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we only bother with this inside the "if it's Windows" clause?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. Fixed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed.

Comment thread src/org/python/core/PySystemState.java Outdated
if (os != null && os.startsWith("Windows")) {
// Go via the Windows code page built-in command "chcp".
String output = Py.getCommandResultWindows("chcp");
String output = Py.getCommandResultWindows("chcp.com");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running chcp from Git-Bash on Windows, it cannot find "chcp" and one has to specify the full name which is "chcp.com". I thought Jython might experience the same issue, but it does not. So will revert this change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C:\> where chcp
C:\Windows\System32\chcp.com

@wfouche
Copy link
Copy Markdown
Member Author

wfouche commented Dec 16, 2025

@jeff5 , the following code fragment is unlikely to be executed on Windows given that stdout.encoding and sun.stdout.encoding seems to cover all bases.

        if (isWindows) {
            // Go via the Windows code page built-in command "chcp".
            String output = Py.getCommandResultWindows("chcp");
            /*
             * The output will be like "Active code page: 850" or maybe "Aktive Codepage: 1252." or
             * "활성 코드 페이지: 949". Assume the first number with 2 or more digits is the code page.
             */
            final Pattern DIGITS_PATTERN = Pattern.compile("[1-9]\\d+");
            Matcher matcher = DIGITS_PATTERN.matcher(output);
            if (matcher.find()) {
                encoding = "cp".concat(output.substring(matcher.start(), matcher.end()));
                if (encoding.equals("cp65001")) {
                    encoding = "utf-8";
                }
                return encoding;
            }

        }

@wfouche
Copy link
Copy Markdown
Member Author

wfouche commented Dec 16, 2025

@jeff5 , I installed SDKMAN using Git-Bash on Windows, and this makes it a lot easier to do Jython console testing using different versions of Java. With SDKMAN I can easily install and switch between diffrent JDK versions.

First install scoop - https://scoop.sh/

Then run these commands to install Git/Bash, curl, zip and unzip

  • scoop install git
  • scoop install curl zip unzip

Lastly, install SDKMAN

Now one can start bash.exe from inside the Jython project folder

cd IdeaProjects\jython

C:\...\IdeaProjects\jython> bash.exe
$ sdk install java 8.0.472-tem
$ sdk install java 25.0.1-tem
$ sdk use java 8.0.472-tem
$ ant installer
$ java -version
openjdk version "1.8.0_472"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_472-b08)
OpenJDK 64-Bit Server VM (Temurin)(build 25.472-b08, mixed mode)
$ java -jar dist/jython-standalone.jar
Jython 2.7.5a1-SNAPSHOT (heads/dev/console-utf8:545df7b60, Dec 16 2025, 09:34:43)
[OpenJDK 64-Bit Server VM (Temurin)] on java1.8.0_472
Type "help", "copyright", "credits" or "license" for more information.
>>>
$ sdk use java 25.0.1-tem
$ java -version
openjdk version "25.0.1" 2025-10-21 LTS
OpenJDK Runtime Environment Temurin-25.0.1+8 (build 25.0.1+8-LTS)
OpenJDK 64-Bit Server VM Temurin-25.0.1+8 (build 25.0.1+8-LTS, mixed mode, sharing)
$ java -jar dist/jython-standalone.jar
Jython 2.7.5a1-SNAPSHOT (heads/dev/console-utf8:545df7b60, Dec 16 2025, 09:34:43)
[OpenJDK 64-Bit Server VM (Eclipse Adoptium)] on java25.0.1
Type "help", "copyright", "credits" or "license" for more information.
>>>
$ exit
C:\...\IdeaProjects\jython>

Bash is just a different Windows shell, so when Jython runs it still runs as it normally would as a Windows process.

Don't you think this would make a good addition to the README.md file?

@wfouche
Copy link
Copy Markdown
Member Author

wfouche commented Dec 18, 2025

Results of tests performanced on Linux, MacOS and Windows. We can see that for Java 21 and higher, property stdout.encoding is always defined. Older versions of Java don't define property stdout.encoding, and may or may not defined property sun.stdout.encoding.

Linux

  • java.version = 1.8.0_462, sun.stdout.encoding = null, stdout.encoding = null
  • java.version = 11.0.29, sun.stdout.encoding = null, stdout.encoding = null
  • java.version = 17.0.16, sun.stdout.encoding = UTF-8, stdout.encoding = null
  • java.version = 21.0.9, sun.stdout.encoding = null, stdout.encoding = UTF-8
  • java.version = 25, sun.stdout.encoding = null, stdout.encoding = UTF-8

MacOS

  • java.version = 1.8.0_472, sun.stdout.encoding = null, stdout.encoding = null
  • java.version = 11.0.29, sun.stdout.encoding = null, stdout.encoding = null
  • java.version = 17.0.17, sun.stdout.encoding = UTF-8, stdout.encoding = null
  • java.version = 21.0.9, sun.stdout.encoding = null, stdout.encoding = UTF-8
  • java.version = 25.0.1, sun.stdout.encoding = null, stdout.encoding = UTF-8

Windows

  • java.version = 1.8.0_472, sun.stdout.encoding = cp65001, stdout.encoding = null
  • java.version = 11.0.29, sun.stdout.encoding = UTF-8, stdout.encoding = null
  • java.version = 17.0.17, sun.stdout.encoding = UTF-8, stdout.encoding = null
  • java.version = 21.0.9, sun.stdout.encoding = null, stdout.encoding = UTF-8
  • java.version = 25.0.1, sun.stdout.encoding = null, stdout.encoding = UTF-8

test.cmd

call jbang run --java 8 test.kt
call jbang run --java 11 test.kt
call jbang run --java 17 test.kt
call jbang run --java 21 test.kt
call jbang run --java 25 test.kt

test.sh

jbang run --java 8 test.kt
jbang run --java 11 test.kt
jbang run --java 17 test.kt
jbang run --java 21 test.kt
jbang run --java 25 test.kt

test.kt

val javaVersion = System.getProperty("java.version") ?: "null"
val sun_stdout_encoding = System.getProperty("sun.stdout.encoding") ?: "null"
val stdout_encoding = System.getProperty("stdout.encoding") ?: "null"

fun main() {
    print("java.version = " + javaVersion)
    print(", sun.stdout.encoding = " + sun_stdout_encoding)
    println(", stdout.encoding = " + stdout_encoding)
}

@wfouche
Copy link
Copy Markdown
Member Author

wfouche commented Dec 18, 2025

@jeff5 , testing has been completed. Please review once more.

Could we eventually release this as Jython 2.7.5?

@wfouche
Copy link
Copy Markdown
Member Author

wfouche commented Mar 5, 2026

@Stewori . please review and if possible merge this PR next.

How these two properties below are managed in newer versions of Java have changed:

  • sun.stdout.encoding
  • stdout.encoding

The old logic failed with Java 25, but with this PR it now it works again. This code has been extensively tested using different major versions of Java.

Comment thread src/org/python/core/PySystemState.java Outdated
// From Java 8 onwards, the answer may already be to hand in the registry:
String encoding = props.getProperty("sun.stdout.encoding");
String os = props.getProperty("os.name");
boolean isWindows = os != null && os.startsWith("Windows");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these two lines (String os = .... and boolean is Windows....) a bit downwards, so they are created only if stdout.encoding is null.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread src/org/python/core/PySystemState.java
Comment thread src/org/python/core/PySystemState.java Outdated
// Java 19+
String encoding = props.getProperty("stdout.encoding");
if (encoding != null) {
// Windows: cp65001 is automatically mapped to UTF-8, no additional processing is needed
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this claim by googling, but struggle to find an authoritative source. I do not doubt this fact. However, if you have a link to an authoritative source for this claim, please add it as second line to this comment. E.g. something in the msdn docs or so.
This is not critical, just nice to have for reference.

Copy link
Copy Markdown
Member

@Stewori Stewori Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I mean that cp65001 is or used to be Window's name for utf-8)

Copy link
Copy Markdown
Member Author

@wfouche wfouche Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The low-level C code in the JVM 19+ is only the documentation for this Windows specific mapping.

https://github.com/naotoj/jdk/blob/a91fd9cab3a701a423ea01e3ed629ef279107e0b/src/java.base/windows/native/libjava/java_props_md.c

static char* getConsoleEncoding()
{
    char* buf = malloc(16);
    int cp;
    if (buf == NULL) {
        return NULL;
    }
    cp = GetConsoleCP();
    if (cp >= 874 && cp <= 950)
        sprintf(buf, "ms%d", cp);
    else if (cp == 65001)
        sprintf(buf, "UTF-8");
    else
        sprintf(buf, "cp%d", cp);
    return buf;
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@Stewori
Copy link
Copy Markdown
Member

Stewori commented Mar 5, 2026

Re-reading #404 it seems that cp65001 might be more subtle than just aliasing it to utf-8. I'll have to do more background research to be able to make an educated decision on this.

In any case, please add a line to NEWS in bugs fixed 2.7.5, similar to

- [GH-404] Windows code page 65001 (UTF-8) not supported by Jython

@Stewori
Copy link
Copy Markdown
Member

Stewori commented Mar 5, 2026

Okay, after a bit research and AI-talk, what about using Charset.forName(...).name() for normalization instead of manually aliasing cp65001 to utf-8? Can you test this? It appears a bit safer and reflects Java's own aliasing mechanism. AI claims that already in Java 8 one would anyway get Charset.forName("cp66001).name()->utf-8. Can you confirm this? If Java does this aliasing anyway, I think it's not so relevant whether there is a subtle difference on Windows. That said, issues are said to be more pre Windows 10, and I think we can nowadays well assume Windows 10+, since Windows 10 is already getting infamously phased out. If we normalize based on Charset.forName we also delegate responsibility to Java, which I think is an advantage.

@Stewori
Copy link
Copy Markdown
Member

Stewori commented Mar 5, 2026

Checking more carefully, Charset.forName is probably not a good idea as it frequently replces leading "cp" by "IBM" or "windows-". So, forget that idea. All in all it seems to me that your solution with normalizing just cp65001 is probably the sweet spot.

@Stewori
Copy link
Copy Markdown
Member

Stewori commented Mar 5, 2026

@jeff5 @tbpassin @ohumbel I would tend to merge this PR after the small change I requested is addressed and NEWS is handled. I'll let it rest for a couple of days to give you some time for veto. Let's say until next Wednesday at least.

@tbpassin
Copy link
Copy Markdown

tbpassin commented Mar 5, 2026

I think the main drawback of using cp65001 is that some utf-8 characters can be misinterpreted or can't be displayed. By contrast, the behavior when using one of the common non-utf-8 code pages, having a wrong character in the output may cause an error. This is for Windows 10+. One could still get errors pre-windows 10 depending on the string, the Java version, etc.

There's no perfect solution when the shell itself isn't fully utf-8 compatible, especially the cmd.exe shell pre-Windows 10, which is limited by its long history.

@Stewori
Copy link
Copy Markdown
Member

Stewori commented Mar 5, 2026

Okay, this is about representing cp65001 as utf-8, not about permitting or encouraging its use in the first place.
What problems may arise? I don't really know where and how the result of getConsoleEncoding is used precisely. Inside Java, cp65001 and utf-8 are aliased anyway, at least since Java 8. So no problem here. I think a benefit of the solution is that if this value should make it to user code, utf-8 is more commonly understood while cp65001 is really exotic. I believe that code which evaluates this is more likely to do the right thing given "utf-8" than "cp65001". According to AI, problems mainly arise if you use the value with old Windows API, but that is often buggy for unicode no matter how the encoding is called. There is no way we can attempt to be bug-compatible with old Windows API. It also claims, for Windows 10 after some update, "utf-8" works quite well and is also recognized directly. So there would also be no harm if we rename cp65001 into utf-8. Finally AI claims, that other tools handle it similarly, including modern CPython. I know perfectly well that AI claims need not be true. It's a starting point.
All in all, I don't really see how representing cp65001 as utf-8 could break anything, except code that is bug-compatible with early Win 10 API or pre-10. That appears to be a rather hypothetical case. Much more likely is that the value is used within Python or Java, where it makes no difference or utf-8 is even the safer choice.
Please correct me if I overlook something.

@tbpassin
Copy link
Copy Markdown

tbpassin commented Mar 5, 2026

According to Mr. Chatbot Claude, for whatever it is worth:

On older Windows versions (Vista through early Windows 8/8.1), CP65001 had a genuine bug: WriteFile/WriteConsoleA could return incorrect byte counts, which caused the C runtime (msvcrt) to misinterpret the result as an error and throw. This affected any process writing multi-byte UTF-8 sequences.
However, this was a C runtime / console host bug, not a Java bug per se. Java's System.out on Windows uses its own I/O path, not msvcrt directly.

If we say, as I think we agreed on, to say the next edition of Jython will only support Windows 10+, we ought to be all right.

@Stewori
Copy link
Copy Markdown
Member

Stewori commented Mar 6, 2026

Thank you for cross-checking this. Looks like it aligns with my info, more specific though (at least AIs agree here^^). I'm not really sure how official the Windows 10+ agreement is, but I think using Windows <10 became so impracticable nowadays, that we don't need to consider it much. Even then - distinguish "support" from "bug-compatibility". IMHO to support something does not mean to compensate (every single of) its bugs. So I wouldn't even frame this as "support for Win < 10 dropped". An then, the problematic API is not even used by Jython directly. This should be safe enough.
@wfouche When you adjust the few bits I requested, we're good to merge this.

@tbpassin
Copy link
Copy Markdown

tbpassin commented Mar 6, 2026

I remember years ago, probably with Python 2+ rather than Jython, I used to monkey patch the code that selected the right encoder/decoder pair so that it would choose cp65001 when the encoding was utf-8. That simplified life quite a bit, encoding-wise. I had forgotten all about that until today. I think that recent versions of CPython 2.7 have changed how they work in that respect.

@wfouche
Copy link
Copy Markdown
Member Author

wfouche commented Mar 6, 2026

@wfouche When you adjust the few bits I requested, we're good to merge this.

@Stewori , I've made the requested changes, and confirmed that Java 19+ detects "cp65001" and maps it to "UTF-8".

@ohumbel
Copy link
Copy Markdown
Contributor

ohumbel commented Mar 9, 2026

The changes look good to me, technically.
I have to admit that I have almost no experience with windows code pages. But I trust the experts on this.

@wfouche
Copy link
Copy Markdown
Member Author

wfouche commented Mar 11, 2026

@Stewori , these changes have been extensively tested on various versions of Java, and I'm condident it is correct.

Ready to be merged.

Copy link
Copy Markdown
Member

@jeff5 jeff5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, nice (Java 8 & 25):

Jython-2>chcp 65001
Jython-2>dist\bin\jython -c"import sys; print sys.stdin.encoding, u'\U0001f40d caf\xe9'"
WARNING: A restricted method in java.lang.System has been called ... blah blah

utf-8 🐍 café

I think this should be simplified (and doesn't need to quote openjdk code) as suggested.

Now, it's nice on the old DOS command prompt cmd, but is not working for me in PowerShell because stdout.encoding always comes up with "cp850".

In PowerShell I get closer with Java 8 (where sun.stdout.encoding correctly reports cp65001, but he output doesn't seem to have got the message:

PS Jython-2> chcp 65001
Active code page: 65001
PS Jython-2> dist\bin\jython -c"import sys; print sys.stdin.encoding, u'\U0001f40d caf\xe9'"
utf-8 ­ƒÉì caf├®

It looks like cp850 again. I think that can't be our fault (and definitely not in scope for the PR) if it works on 25 and with 8 in cmd.

// From Java 8 onwards, the answer may already be to hand in the registry:
String encoding = props.getProperty("sun.stdout.encoding");
String os = props.getProperty("os.name");
// Java 19+
Copy link
Copy Markdown
Member

@jeff5 jeff5 Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are really two opportunities being addressed in this PR:

  1. To prefer standard property stdout.encoding superseding the optional sun.stdout.encoding.
  2. To trust that cp65001 means UTF-8 now it has matured.

The first can be done rather simply. The second involves checking a string, but I don't think we need guard it with isWindows. If a non-windows OS claims cp65001 (in an emulator?) it probably means the same thing.

I am more sure that Python codec names are lowercase than I am that Java implementations will spell only UTF-8 in uppercase, so I think we can lowercase unconditionally.

We check cp65001 in two places, but it wouldn't take much change (and might be an improvement) to move all the returns to the end and do the check once there.

Suggested change
// Java 19+
// The answer may be in the registry:
String encoding = props.getProperty("stdout.encoding"); // Java 19+
if (encoding==null) {
// Some Java versions define:
encoding = props.getProperty("sun.stdout.encoding");
}
if (encoding != null) {
encoding = encoding.toLowerCase();
if (encoding.equals("cp65001")) {
// There is no Python codec for cp65001 but Windows means UTF-8
encoding = "utf-8";
}
return encoding;
}
// If the answer was not obtained from the registry, ask the OS.
String os = props.getProperty("os.name");
boolean isWindows = os != null && os.startsWith("Windows");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't commit from here. GitHub has misunderstood what I intended to replace, but you'll see what I mean.

@jeff5
Copy link
Copy Markdown
Member

jeff5 commented Apr 10, 2026

Actually, I'm not getting consistent results from Java 25. I can see further up my console that this worked on Java 25, and now stdout.encoding is reporting cp850.

Even more weirdly, stdin.encoding is reporting UTF-8. From Eclipse debugger expressions on entry to org.python.core.PySystemState.getConsoleEncoding(Properties):

registry.getProperty("stdout.encoding")	cp850	
registry.getProperty("stdin.encoding")	UTF-8	
registry.getProperty("python.console.encoding")	null	
registry.getProperty("sun.stdout.encoding")	null	
registry.getProperty("sun.stdin.encoding")	null	

@jeff5
Copy link
Copy Markdown
Member

jeff5 commented Apr 10, 2026

I have to admit that I have almost no experience with windows code pages. But I trust the experts on this.

You don't need a lot of experience to know you hate them.

Ok, in our present case, it seems you actually have to set the code page to 65001, not just check that it is 65001.

PS Jython-2> cmd
Microsoft Windows [Version 10.0.26200.8039]
(c) Microsoft Corporation. All rights reserved.

C:\Users\Jeff\Documents\Eclipse-Q\Jython-2>chcp
Active code page: 65001

C:\Users\Jeff\Documents\Eclipse-Q\Jython-2>dist\bin\jython -c"print u'\U0001f40d caf\xe9'"
WARNING: ...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Jeff\Documents\Eclipse-Q\Jython-2\dist\Lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\U0001f40d' in position 0: character maps to <undefined>

C:\Users\Jeff\Documents\Eclipse-Q\Jython-2>chcp 65001
Active code page: 65001

C:\Users\Jeff\Documents\Eclipse-Q\Jython-2>dist\bin\jython -c"print u'\U0001f40d caf\xe9'"
WARNING...

🐍 café

Windows bug?

@Stewori
Copy link
Copy Markdown
Member

Stewori commented Apr 10, 2026

doesn't need to quote openjdk code

That was not actually my intention when I asked for a reference. What I actually meant was an authoritative reference (url or some citation coordinates) to a source (e.g. in MSDN docs) that really states that cp65001 means UTF-8, or at least is supposed to mean. From what I found online, this more seems to be "folklore" and an actual written official statement is hard to find. So if there is one, it would be valuable to preserve that reference in a comment for the future.

@yogi1967
Copy link
Copy Markdown

See
https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-ucoderef/28fefe92-d66c-4b03-90a9-97b473223d43

Last line of table.

@tbpassin
Copy link
Copy Markdown

As I have posted earlier, CP65001 is not full utf-8. It's a design limitation left over from the old, old design. There are not uncommon edge cases that fail. UTF-16 surrogate pairs are especially likely to cause trouble. Yes, there are bugs, and they apparently vary from one to another versions of Windows.

Since WIndows uses utf-16 internally, any code point above U+FFFF must be encoded as a surrogate pair.

@jeff5
Copy link
Copy Markdown
Member

jeff5 commented Apr 10, 2026

Also https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers . When I looked a few years ago there seemed to be questions about how complete or reliable the implementation was, but I understand it as always intended to be UTF-8.

I found the reference into OpenJDK useful, but I think it belongs in a discussion like this. It is probably better to link a stable repo/branch. https://github.com/openjdk/jdk25u/blob/master/src/java.base/windows/native/libjava/java_props_md.c#L131 (to the discussion, I mean).

@jeff5
Copy link
Copy Markdown
Member

jeff5 commented Apr 11, 2026

As I have posted earlier, CP65001 is not full utf-8. ... Yes, there are bugs, and they apparently vary from one to another versions of Windows.

I wonder if my poor experience with PoSh is a hang-over from that? I have followed this through in the debugger now to the point where UTF-8 bytes disappear into a channel (which I think is our console ... not sure) and they still come out as utf-8 ­-ƒÉì caf├® on PoSh and 🐍 café on cmd.

Since WIndows uses utf-16 internally, any code point above U+FFFF must be encoded as a surrogate pair.

I don't think this affects us at the Python level, or my \U0001f40d would not have printed as a snake. You might say the same about Java, which is our problem, but while our support for Unicode has its problems still 1, it passes a lot of tests it wouldn't if we ourselves hadn't largely dealt with Java's UTF-16.


Footnotes

  1. Problems exist internally to Jython where a method takes a String, and was written on the assumption it would represent bytes, but might now receive a full-range Java String that represents a Python 2 unicode.

@jeff5 jeff5 added this to the 2.7.5 milestone Apr 11, 2026
@wfouche
Copy link
Copy Markdown
Member Author

wfouche commented Apr 12, 2026

@jeff5 , thanks for the review and feedback. I'll work on the PR updates in the coming days.

@jeff5
Copy link
Copy Markdown
Member

jeff5 commented Apr 18, 2026

I think this also fixes #384 which I noticed, but forgot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Windows code page 65001 (UTF-8) not supported by Jython

6 participants