unicode_id_start/
lib.rs

1//! [![github]](https://github.com/Boshen/unicode-id-start) [![crates-io]](https://crates.io/crates/unicode-id-start) [![docs-rs]](https://docs.rs/unicode-id-start)
2//!
3//! [github]: https://img.shields.io/badge/github-8da0cb?style=for-the-badge&labelColor=555555&logo=github
4//! [crates-io]: https://img.shields.io/badge/crates.io-fc8d62?style=for-the-badge&labelColor=555555&logo=rust
5//! [docs-rs]: https://img.shields.io/badge/docs.rs-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs
6//!
7//! <br>
8//!
9//! Implementation of [Unicode Standard Annex #31][tr31] for determining which
10//! `char` values are valid in programming language identifiers.
11//!
12//! [tr31]: https://www.unicode.org/reports/tr31/
13//!
14//! This crate is a better optimized implementation of the older `unicode-id`
15//! crate. This crate uses less static storage, and is able to classify both
16//! ASCII and non-ASCII codepoints with better performance, 2&ndash;10&times;
17//! faster than `unicode-id`.
18//!
19//! <br>
20//!
21//! ## Comparison of performance
22//!
23//! The following table shows a comparison between five Unicode identifier
24//! implementations.
25//!
26//! - `unicode-id-start` is this crate, which is a fork of `unicode-ident`;
27//! - [`unicode-xid`] is a widely used crate run by the "unicode-rs" org; [`unicode-id`] is a fork of `unicode-xid`;
28//! - `ucd-trie` and `fst` are two data structures supported by the
29//!   [`ucd-generate`] tool;
30//! - [`roaring`] is a Rust implementation of Roaring bitmap.
31//!
32//! The *static storage* column shows the total size of `static` tables that the
33//! crate bakes into your binary, measured in 1000s of bytes.
34//!
35//! The remaining columns show the **cost per call** to evaluate whether a
36//! single `char` has the ID\_Start or ID\_Continue Unicode property,
37//! comparing across different ratios of ASCII to non-ASCII codepoints in the
38//! input data.
39//!
40//! [`unicode-ident`]: https://github.com/dtolnay/unicode-ident
41//! [`unicode-xid`]: https://github.com/unicode-rs/unicode-xid
42//! [`unicode-id`]: https://github.com/Boshen/unicode-id
43//! [`ucd-generate`]: https://github.com/BurntSushi/ucd-generate
44//! [`roaring`]: https://github.com/RoaringBitmap/roaring-rs
45//!
46//! | | static storage | 0% nonascii | 1% | 10% | 100% nonascii |
47//! |---|---|---|---|---|---|
48//! | **`unicode-id-start`** | 9.68 K | 0.96 ns | 0.95 ns | 1.09 ns | 1.55 ns |
49//! | **`unicode-id`** | 11.23 K | 1.88 ns | 2.14 ns | 3.48 ns | 15.63 ns |
50//! | **`ucd-trie`** | 9.93 K | 1.29 ns | 1.28 ns | 1.36 ns | 2.15 ns |
51//! | **`fst`** | 131 K | 55.1 ns | 54.9 ns | 53.2 ns | 28.5 ns |
52//! | **`roaring`** | 66.1 K | 2.78 ns | 3.09 ns | 3.37 ns | 4.70 ns |
53//! | **`roaring`** | 66.1 K | 2.78 ns | 3.09 ns | 3.37 ns | 4.70 ns |
54//!
55//! Source code for the benchmark is provided in the *bench* directory of this
56//! repo and may be repeated by running `cargo criterion`.
57//!
58//! <br>
59//!
60//! ## Comparison of data structures
61//!
62//! #### unicode-id
63//!
64//! They use a sorted array of character ranges, and do a binary search to look
65//! up whether a given character lands inside one of those ranges.
66//!
67//! ```rust
68//! # const _: &str = stringify! {
69//! static ID_Continue_table: [(char, char); 763] = [
70//!     ('\u{30}', '\u{39}'),  // 0-9
71//!     ('\u{41}', '\u{5a}'),  // A-Z
72//! # "
73//!     …
74//! # "
75//!     ('\u{e0100}', '\u{e01ef}'),
76//! ];
77//! # };
78//! ```
79//!
80//! The static storage used by this data structure scales with the number of
81//! contiguous ranges of identifier codepoints in Unicode. Every table entry
82//! consumes 8 bytes, because it consists of a pair of 32-bit `char` values.
83//!
84//! In some ranges of the Unicode codepoint space, this is quite a sparse
85//! representation &ndash; there are some ranges where tens of thousands of
86//! adjacent codepoints are all valid identifier characters. In other places,
87//! the representation is quite inefficient. A characater like `µ` (U+00B5)
88//! which is surrounded by non-identifier codepoints consumes 64 bits in the
89//! table, while it would be just 1 bit in a dense bitmap.
90//!
91//! On a system with 64-byte cache lines, binary searching the table touches 7
92//! cache lines on average. Each cache line fits only 8 table entries.
93//! Additionally, the branching performed during the binary search is probably
94//! mostly unpredictable to the branch predictor.
95//!
96//! Overall, the crate ends up being about 10&times; slower on non-ASCII input
97//! compared to the fastest crate.
98//!
99//! A potential improvement would be to pack the table entries more compactly.
100//! Rust's `char` type is a 21-bit integer padded to 32 bits, which means every
101//! table entry is holding 22 bits of wasted space, adding up to 3.9 K. They
102//! could instead fit every table entry into 6 bytes, leaving out some of the
103//! padding, for a 25% improvement in space used. With some cleverness it may be
104//! possible to fit in 5 bytes or even 4 bytes by storing a low char and an
105//! extent, instead of low char and high char. I don't expect that performance
106//! would improve much but this could be the most efficient for space across all
107//! the libraries, needing only about 7 K to store.
108//!
109//! #### ucd-trie
110//!
111//! Their data structure is a compressed trie set specifically tailored for
112//! Unicode codepoints. The design is credited to Raph Levien in
113//! [rust-lang/rust#33098].
114//!
115//! [rust-lang/rust#33098]: https://github.com/rust-lang/rust/pull/33098
116//!
117//! ```rust
118//! pub struct TrieSet {
119//!     tree1_level1: &'static [u64; 32],
120//!     tree2_level1: &'static [u8; 992],
121//!     tree2_level2: &'static [u64],
122//!     tree3_level1: &'static [u8; 256],
123//!     tree3_level2: &'static [u8],
124//!     tree3_level3: &'static [u64],
125//! }
126//! ```
127//!
128//! It represents codepoint sets using a trie to achieve prefix compression. The
129//! final states of the trie are embedded in leaves or "chunks", where each
130//! chunk is a 64-bit integer. Each bit position of the integer corresponds to
131//! whether a particular codepoint is in the set or not. These chunks are not
132//! just a compact representation of the final states of the trie, but are also
133//! a form of suffix compression. In particular, if multiple ranges of 64
134//! contiguous codepoints have the same Unicode properties, then they all map to
135//! the same chunk in the final level of the trie.
136//!
137//! Being tailored for Unicode codepoints, this trie is partitioned into three
138//! disjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints
139//! \[0, 0x800), the second \[0x800, 0x10000) and the third \[0x10000,
140//! 0x110000). These partitions conveniently correspond to the space of 1 or 2
141//! byte UTF-8 encoded codepoints, 3 byte UTF-8 encoded codepoints and 4 byte
142//! UTF-8 encoded codepoints, respectively.
143//!
144//! Lookups in this data structure are significantly more efficient than binary
145//! search. A lookup touches either 1, 2, or 3 cache lines based on which of the
146//! trie partitions is being accessed.
147//!
148//! One possible performance improvement would be for this crate to expose a way
149//! to query based on a UTF-8 encoded string, returning the Unicode property
150//! corresponding to the first character in the string. Without such an API, the
151//! caller is required to tokenize their UTF-8 encoded input data into `char`,
152//! hand the `char` into `ucd-trie`, only for `ucd-trie` to undo that work by
153//! converting back into the variable-length representation for trie traversal.
154//!
155//! #### fst
156//!
157//! Uses a [finite state transducer][fst]. This representation is built into
158//! [ucd-generate] but I am not aware of any advantage over the `ucd-trie`
159//! representation. In particular `ucd-trie` is optimized for storing Unicode
160//! properties while `fst` is not.
161//!
162//! [fst]: https://github.com/BurntSushi/fst
163//! [ucd-generate]: https://github.com/BurntSushi/ucd-generate
164//!
165//! As far as I can tell, the main thing that causes `fst` to have large size
166//! and slow lookups for this use case relative to `ucd-trie` is that it does
167//! not specialize for the fact that only 21 of the 32 bits in a `char` are
168//! meaningful. There are some dense arrays in the structure with large ranges
169//! that could never possibly be used.
170//!
171//! #### roaring
172//!
173//! This crate is a pure-Rust implementation of [Roaring Bitmap], a data
174//! structure designed for storing sets of 32-bit unsigned integers.
175//!
176//! [Roaring Bitmap]: https://roaringbitmap.org/about/
177//!
178//! Roaring bitmaps are compressed bitmaps which tend to outperform conventional
179//! compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can
180//! be hundreds of times faster and they often offer significantly better
181//! compression.
182//!
183//! In this use case the performance was reasonably competitive but still
184//! substantially slower than the Unicode-optimized crates. Meanwhile the
185//! compression was significantly worse, requiring 6&times; as much storage for
186//! the data structure.
187//!
188//! I also benchmarked the [`croaring`] crate which is an FFI wrapper around the
189//! C reference implementation of Roaring Bitmap. This crate was consistently
190//! about 15% slower than pure-Rust `roaring`, which could just be FFI overhead.
191//! I did not investigate further.
192//!
193//! [`croaring`]: https://crates.io/crates/croaring
194//!
195//! #### unicode-ident
196//!
197//! This crate is most similar to the `ucd-trie` library, in that it's based on
198//! bitmaps stored in the leafs of a trie representation, achieving both prefix
199//! compression and suffix compression.
200//!
201//! The key differences are:
202//!
203//! - Uses a single 2-level trie, rather than 3 disjoint partitions of different
204//!   depth each.
205//! - Uses significantly larger chunks: 512 bits rather than 64 bits.
206//! - Compresses the ID\_Start and ID\_Continue properties together
207//!   simultaneously, rather than duplicating identical trie leaf chunks across
208//!   the two.
209//!
210//! The following diagram show the ID\_Start and ID\_Continue Unicode boolean
211//! properties in uncompressed form, in row-major order:
212//!
213//! <table>
214//! <tr><th>ID_Start</th><th>ID_Continue</th></tr>
215//! <tr>
216//! <td><img alt="ID_Start bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647353-c6eeb922-afec-49b2-9ef5-c03e9d1e0760.png"></td>
217//! <td><img alt="ID_Continue bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647367-f447cca7-2362-4d7d-8cd7-d21c011d329b.png"></td>
218//! </tr>
219//! </table>
220//!
221//! Uncompressed, these would take 140 K to store, which is beyond what would be
222//! reasonable. However, as you can see there is a large degree of similarity
223//! between the two bitmaps and across the rows, which lends well to
224//! compression.
225//!
226//! This crate stores one 512-bit "row" of the above bitmaps in the leaf level
227//! of a trie, and a single additional level to index into the leafs. It turns
228//! out there are 124 unique 512-bit chunks across the two bitmaps so 7 bits are
229//! sufficient to index them.
230//!
231//! The chunk size of 512 bits is selected as the size that minimizes the total
232//! size of the data structure. A smaller chunk, like 256 or 128 bits, would
233//! achieve better deduplication but require a larger index. A larger chunk
234//! would increase redundancy in the leaf bitmaps. 512 bit chunks are the
235//! optimum for total size of the index plus leaf bitmaps.
236//!
237//! In fact since there are only 124 unique chunks, we can use an 8-bit index
238//! with a spare bit to index at the half-chunk level. This achieves an
239//! additional 8.5% compression by eliminating redundancies between the second
240//! half of any chunk and the first half of any other chunk. Note that this is
241//! not the same as using chunks which are half the size, because it does not
242//! necessitate raising the size of the trie's first level.
243//!
244//! In contrast to binary search or the `ucd-trie` crate, performing lookups in
245//! this data structure is straight-line code with no need for branching.
246
247#![no_std]
248#![doc(html_root_url = "https://docs.rs/unicode-id-start/1.1.0")]
249#![allow(clippy::doc_markdown, clippy::must_use_candidate)]
250
251#[rustfmt::skip]
252mod tables;
253
254use crate::tables::{ASCII_CONTINUE, ASCII_START, CHUNK, LEAF, TRIE_CONTINUE, TRIE_START};
255
256/// Check ascii and unicode for id_start
257#[inline]
258pub fn is_id_start(ch: char) -> bool {
259    if ch.is_ascii() {
260        return ASCII_START.0[ch as usize];
261    }
262    is_id_start_unicode(ch)
263}
264
265/// Check unicode only for id_start
266#[inline]
267pub fn is_id_start_unicode(ch: char) -> bool {
268    let chunk = *TRIE_START.0.get(ch as usize / 8 / CHUNK).unwrap_or(&0);
269    let offset = chunk as usize * CHUNK / 2 + ch as usize / 8 % CHUNK;
270    unsafe { LEAF.0.get_unchecked(offset) }.wrapping_shr(ch as u32 % 8) & 1 != 0
271}
272
273/// Check ascii and unicode for id_continue
274#[inline]
275pub fn is_id_continue(ch: char) -> bool {
276    if ch.is_ascii() {
277        return ASCII_CONTINUE.0[ch as usize];
278    }
279    is_id_continue_unicode(ch)
280}
281
282/// Check and unicode only for id_continue
283#[inline]
284pub fn is_id_continue_unicode(ch: char) -> bool {
285    let chunk = *TRIE_CONTINUE.0.get(ch as usize / 8 / CHUNK).unwrap_or(&0);
286    let offset = chunk as usize * CHUNK / 2 + ch as usize / 8 % CHUNK;
287    unsafe { LEAF.0.get_unchecked(offset) }.wrapping_shr(ch as u32 % 8) & 1 != 0
288}