Use Unicode properties for alnum, alpha, etc. by joshuamegnauth54 · Pull Request #7626 · RustPython/RustPython

joshuamegnauth54 · 2026-04-18T19:53:59Z

Rust and Python differ in which properties they use for alphanumeric, numeric, et cetera. Both languages list which properties are used which makes it easy to mimic Python's behavior in Rust.

My previous patch was a bit shortsighted because I filtered out combining characters from is_alphanumeric. Using properties is exact and also much cleaner. It also covers edge cases that my initial approach missed.

Besides isalnum, I also fixed isnumeric and isdigit in the same way by using properties.

This contributes a bit to #7527. Unicode issues for isalnum etc. should be solved now (and permanently too!!). Title casing looks harder to fix, but the basic solution of using properties more should apply there too. Also, the RegEx engine can benefit from using properties more too, but I'm terrible at RegEx so I didn't really try to fix issues there. 😆

For extra reading, here is Python's docs which lists the categories used. Look at isalpha(), for example.

Summary by CodeRabbit

Bug Fixes
- Improved Unicode classification for string methods (isalnum, isalpha, isdigit, isnumeric) for more accurate behavior across international characters and special Unicode categories.
Tests
- Added/expanded tests covering edge cases: superscript digits, fractions, Roman numerals, and additional non-Latin characters to validate classification behavior.

coderabbitai · 2026-04-18T19:54:14Z

📝 Walkthrough

Walkthrough

This PR replaces Rust/Python built-in Unicode classification in several locations with ICU-based checks (GeneralCategory / GeneralCategoryGroup / NumericType) and updates tests to cover additional Unicode edge cases. Changes touch the SRE engine, VM string builtins, and related tests.

Changes

Cohort / File(s)	Summary
SRE engine classification `crates/sre_engine/src/string.rs`	Replaced `char::is_alphanumeric()` + Canonical Combining Class filter with ICU `GeneralCategory` / `GeneralCategoryGroup` membership checks; updated imports.
VM str builtins `crates/vm/src/builtins/str.rs`	Rewrote `isalnum`, `isnumeric`, `isdigit`, and `isalpha` to use ICU `GeneralCategoryGroup` and `NumericType` checks; removed `CanonicalCombiningClass` usage and adjusted imports.
Tests — Unicode edge cases `extra_tests/snippets/builtin_str.py`	Added assertions for `str.isalpha()`/`str.isalnum()` and `str.isdigit()`/`str.isnumeric()` covering characters like `\u093f`, `⁰`, `½`, `Ⅻ`, and `\u3405`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix: Python-Rust combining char diff in isalnum #7612: Modifies isalnum and related string classification logic; addresses similar areas but with different ICU property usage.
Replace unmaintained unic crates #7555: Converts string classification to use icu_properties and overlaps with the ICU-based refactor in crates/vm/src/builtins/str.rs.

Suggested reviewers

ShaharNaveh
youknowone

Poem

🐰
Hopping through Unicode, nimble and spry,
I swap old helpers for ICU's eye.
Letters and numbers now neatly aligned,
Tests stretched wider — no case left behind.
✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Use Unicode properties for alnum, alpha, etc.' directly and clearly summarizes the main change: switching character classification methods to use Unicode properties instead of Rust/Python built-in helpers.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

crates/vm/src/builtins/str.rs (2)

959-979: Prefer matches! over array .contains(&..) for small enum sets.

[NumericType::Decimal, NumericType::Digit, NumericType::Numeric].contains(&NumericType::for_char(c)) constructs a stack array and does a linear search on every character. matches! is more idiomatic, reads better, and the compiler lowers it to a single jump table / bitwise test. Same applies to isdigit on Line 977.

Also: the lingering comment on Line 974 ("python's isdigit also checks if exponents are digits, these are the unicode codepoints for exponents") referred to the old hardcoded superscript list; with NumericType::Digit handling those uniformly, the tail of that sentence is stale and could be trimmed.

♻️ Proposed cleanup

     #[pymethod]
     fn isnumeric(&self) -> bool {
         !self.data.is_empty()
             && self.char_all(|c| {
-                [
-                    NumericType::Decimal,
-                    NumericType::Digit,
-                    NumericType::Numeric,
-                ]
-                .contains(&NumericType::for_char(c))
+                matches!(
+                    NumericType::for_char(c),
+                    NumericType::Decimal | NumericType::Digit | NumericType::Numeric
+                )
             })
     }

     #[pymethod]
     fn isdigit(&self) -> bool {
-        // python's isdigit also checks if exponents are digits, these are the unicode codepoints for exponents
+        // Matches CPython: Numeric_Type=Digit or Numeric_Type=Decimal (covers superscripts, etc.).
         !self.data.is_empty()
             && self.char_all(|c| {
-                [NumericType::Digit, NumericType::Decimal].contains(&NumericType::for_char(c))
+                matches!(
+                    NumericType::for_char(c),
+                    NumericType::Digit | NumericType::Decimal
+                )
             })
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/builtins/str.rs` around lines 959 - 979, Replace the
small-array-based enum membership checks in the isnumeric and isdigit methods
with matches! pattern matching to avoid allocating/linear-search semantics: in
isnumeric (fn isnumeric) change the closure that currently uses
[NumericType::Decimal, NumericType::Digit,
NumericType::Numeric].contains(&NumericType::for_char(c)) to a matches! on
NumericType::for_char(c) matching Decimal | Digit | Numeric; similarly in
isdigit (fn isdigit) replace the [NumericType::Digit,
NumericType::Decimal].contains(&NumericType::for_char(c)) check with
matches!(NumericType::for_char(c), Digit | Decimal). Also trim the stale portion
of the comment inside isdigit that mentions explicit superscript codepoints so
the comment only documents that isdigit checks digit/exponent categories if
kept.

948-957: Consider consolidating isalnum to use GeneralCategoryGroup::Number for consistency.

GeneralCategoryGroup::Letter ∪ DecimalNumber + c.is_numeric() correctly covers L*, Nd, Nl, and No (matching CPython's requirements). However, this method mixes ICU and Rust stdlib classification, while sibling methods (isalpha, isdecimal, isnumeric) use ICU exclusively. GeneralCategoryGroup::Number (available in icu_properties v2) expands to exactly Nd | Nl | No, making the refactor below functionally equivalent while consolidating to a single source of truth:

♻️ Optional simplification

 fn isalnum(&self) -> bool {
     !self.data.is_empty()
         && self.char_all(|c| {
             GeneralCategoryGroup::Letter
-                .union(GeneralCategoryGroup::DecimalNumber)
+                .union(GeneralCategoryGroup::Number)
                 .contains(GeneralCategory::for_char(c))
-                || c.is_numeric()
         })
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/builtins/str.rs` around lines 948 - 957, The isalnum
implementation in the isalnum method should use ICU-only classification for
consistency: replace the current check that unions GeneralCategoryGroup::Letter
and GeneralCategoryGroup::DecimalNumber and also calls c.is_numeric() with a
single GeneralCategoryGroup::Number.contains(GeneralCategory::for_char(c))
combined with the existing Letter check (i.e., Letter ∪ Number) so char_all only
uses ICU properties; update the predicate inside str::isalnum (and keep the
!self.data.is_empty() guard) to use GeneralCategoryGroup::Number and remove the
c.is_numeric() call.

crates/sre_engine/src/string.rs (1)

444-454: Consider unifying numeric classification to ICU exclusively.

The || c.is_numeric() clause uses Rust's stdlib (which checks General_Category Nd|Nl|No), while the rest of the logic relies on ICU. Both approaches are functionally equivalent, but mixing classification sources makes invariants harder to reason about. For consistency with the ICU-based approach used elsewhere, replace DecimalNumber with Number in the union and remove the is_numeric() fallback:

Optional refactoring

-        .map(|c| {
-            GeneralCategoryGroup::Letter
-                .union(GeneralCategoryGroup::DecimalNumber)
-                .contains(GeneralCategory::for_char(c))
-                || c.is_numeric()
-        })
+        .map(|c| {
+            GeneralCategoryGroup::Letter
+                .union(GeneralCategoryGroup::Number)
+                .contains(GeneralCategory::for_char(c))
+        })

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/sre_engine/src/string.rs` around lines 444 - 454, The is_uni_alnum
function mixes stdlib numeric classification (c.is_numeric()) with ICU-based
checks; change the ICU union to use GeneralCategoryGroup::Number instead of
DecimalNumber and remove the c.is_numeric() fallback so all numeric
classification comes from ICU (update the union expression in is_uni_alnum and
eliminate the || c.is_numeric() branch).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/sre_engine/src/string.rs`:
- Around line 444-454: The is_uni_alnum function mixes stdlib numeric
classification (c.is_numeric()) with ICU-based checks; change the ICU union to
use GeneralCategoryGroup::Number instead of DecimalNumber and remove the
c.is_numeric() fallback so all numeric classification comes from ICU (update the
union expression in is_uni_alnum and eliminate the || c.is_numeric() branch).

In `@crates/vm/src/builtins/str.rs`:
- Around line 959-979: Replace the small-array-based enum membership checks in
the isnumeric and isdigit methods with matches! pattern matching to avoid
allocating/linear-search semantics: in isnumeric (fn isnumeric) change the
closure that currently uses [NumericType::Decimal, NumericType::Digit,
NumericType::Numeric].contains(&NumericType::for_char(c)) to a matches! on
NumericType::for_char(c) matching Decimal | Digit | Numeric; similarly in
isdigit (fn isdigit) replace the [NumericType::Digit,
NumericType::Decimal].contains(&NumericType::for_char(c)) check with
matches!(NumericType::for_char(c), Digit | Decimal). Also trim the stale portion
of the comment inside isdigit that mentions explicit superscript codepoints so
the comment only documents that isdigit checks digit/exponent categories if
kept.
- Around line 948-957: The isalnum implementation in the isalnum method should
use ICU-only classification for consistency: replace the current check that
unions GeneralCategoryGroup::Letter and GeneralCategoryGroup::DecimalNumber and
also calls c.is_numeric() with a single
GeneralCategoryGroup::Number.contains(GeneralCategory::for_char(c)) combined
with the existing Letter check (i.e., Letter ∪ Number) so char_all only uses ICU
properties; update the predicate inside str::isalnum (and keep the
!self.data.is_empty() guard) to use GeneralCategoryGroup::Number and remove the
c.is_numeric() call.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: bed164de-3685-4fff-aa41-dfbe24da72da

📥 Commits

Reviewing files that changed from the base of the PR and between b80c2bd and 3a5438e.

📒 Files selected for processing (3)

crates/sre_engine/src/string.rs
crates/vm/src/builtins/str.rs
extra_tests/snippets/builtin_str.py

ShaharNaveh · 2026-04-18T20:51:38Z

UNEXPECTED SUCCESS: test_isdigit (test.test_str.StrTest.test_isdigit)

🥳

Rust and Python differ in which properties they use for alphanumeric, numeric, et cetera. Both languages list which properties are used which makes it easy to mimic Python's behavior in Rust. My previous patch was a bit shortsighted because I filtered out combining characters from is_alphanumeric. Using properties is exact and also much cleaner. It also covers edge cases that my initial approach missed. Besides isalnum, I also fixed isnumeric and isdigit in the same way by using properties.

joshuamegnauth54 · 2026-04-19T00:33:53Z

I force pushed to handle one of CodeRabbit's suggestions and enable the passing test. CodeRabbit's first suggestion (the one that says to use matches!) is incorrect because NumericType is not an enum, but the second suggestion cleans up the code and still works! 😁

I was very confused why that test kept "failing". 😆 It's awesome that it passes now.

coderabbitai

🧹 Nitpick comments (1)

crates/vm/src/builtins/str.rs (1)

949-977: Consider centralizing the character-class predicates.

The mappings look correct, but isalnum now encodes its own union while isnumeric, isdigit, and isalpha encode related rules separately. Small helpers would keep these Python semantics in lockstep and avoid repeated per-character set construction.

♻️ Proposed helper extraction

+    #[inline]
+    fn char_isalpha(c: char) -> bool {
+        GeneralCategoryGroup::Letter.contains(GeneralCategory::for_char(c))
+    }
+
+    #[inline]
+    fn char_isnumeric(c: char) -> bool {
+        matches!(
+            NumericType::for_char(c),
+            NumericType::Decimal | NumericType::Digit | NumericType::Numeric
+        )
+    }
+
+    #[inline]
+    fn char_isdigit(c: char) -> bool {
+        matches!(
+            NumericType::for_char(c),
+            NumericType::Decimal | NumericType::Digit
+        )
+    }
+
     #[pymethod]
     fn isalnum(&self) -> bool {
-        !self.data.is_empty()
-            && self.char_all(|c| {
-                GeneralCategoryGroup::Letter
-                    .union(GeneralCategoryGroup::Number)
-                    .contains(GeneralCategory::for_char(c))
-            })
+        !self.data.is_empty() && self.char_all(|c| Self::char_isalpha(c) || Self::char_isnumeric(c))
     }
 
     #[pymethod]
     fn isnumeric(&self) -> bool {
-        !self.data.is_empty()
-            && self.char_all(|c| {
-                [
-                    NumericType::Decimal,
-                    NumericType::Digit,
-                    NumericType::Numeric,
-                ]
-                .contains(&NumericType::for_char(c))
-            })
+        !self.data.is_empty() && self.char_all(Self::char_isnumeric)
     }
 
     #[pymethod]
     fn isdigit(&self) -> bool {
-        !self.data.is_empty()
-            && self.char_all(|c| {
-                [NumericType::Digit, NumericType::Decimal].contains(&NumericType::for_char(c))
-            })
+        !self.data.is_empty() && self.char_all(Self::char_isdigit)
     }
...
     #[pymethod]
     fn isalpha(&self) -> bool {
-        !self.data.is_empty()
-            && self
-                .char_all(|c| GeneralCategoryGroup::Letter.contains(GeneralCategory::for_char(c)))
+        !self.data.is_empty() && self.char_all(Self::char_isalpha)
     }

Also applies to: 1072-1077

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/builtins/str.rs` around lines 949 - 977, Centralize the
per-character predicates by extracting reusable helper functions (e.g.,
char_is_alnum, char_is_numeric, char_is_digit, char_is_alpha) and use those
inside the existing methods (isalnum, isnumeric, isdigit, isalpha) instead of
building arrays/unions inside each char_all closure; reference the existing
char_all method, NumericType::for_char, and
GeneralCategoryGroup/GeneralCategory::for_char when implementing the helpers so
the logic is shared and avoids repeated per-call set/array construction.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/vm/src/builtins/str.rs`:
- Around line 949-977: Centralize the per-character predicates by extracting
reusable helper functions (e.g., char_is_alnum, char_is_numeric, char_is_digit,
char_is_alpha) and use those inside the existing methods (isalnum, isnumeric,
isdigit, isalpha) instead of building arrays/unions inside each char_all
closure; reference the existing char_all method, NumericType::for_char, and
GeneralCategoryGroup/GeneralCategory::for_char when implementing the helpers so
the logic is shared and avoids repeated per-call set/array construction.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: f2da8a0f-e31c-422a-a93a-d692db12dc36

📥 Commits

Reviewing files that changed from the base of the PR and between 3a5438e and c7b254b.

⛔ Files ignored due to path filters (1)

Lib/test/test_str.py is excluded by !Lib/**

📒 Files selected for processing (3)

crates/sre_engine/src/string.rs
crates/vm/src/builtins/str.rs
extra_tests/snippets/builtin_str.py

🚧 Files skipped from review as they are similar to previous changes (2)

extra_tests/snippets/builtin_str.py
crates/sre_engine/src/string.rs

github-actions · 2026-04-19T00:40:13Z

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

(module 'base64 test_cmath test_compile test_fractions test_grammar test_patma test_peepholer test_pep646_syntax test_str test_sys_settrace test_tstring test_typing' not found)

Legend:

[+] path exists in CPython
[x] up-to-date, [ ] outdated

coderabbitai bot reviewed Apr 18, 2026

View reviewed changes

joshuamegnauth54 force-pushed the use-icu-for-everything branch from 3a5438e to c7b254b Compare April 19, 2026 00:32

coderabbitai bot reviewed Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Unicode properties for alnum, alpha, etc.#7626

Use Unicode properties for alnum, alpha, etc.#7626
joshuamegnauth54 wants to merge 1 commit intoRustPython:mainfrom
joshuamegnauth54:use-icu-for-everything

joshuamegnauth54 commented Apr 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

ShaharNaveh commented Apr 18, 2026

Uh oh!

joshuamegnauth54 commented Apr 19, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joshuamegnauth54 commented Apr 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh commented Apr 18, 2026

Uh oh!

joshuamegnauth54 commented Apr 19, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 19, 2026

📦 Library Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joshuamegnauth54 commented Apr 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 18, 2026 •

edited

Loading