Code Examples
package main
import (
"fmt"
"golang.org/x/text/unicode/norm"
)
func main() {
s := norm.NFD.String("Mêlée")
for i := 0; i < len(s); {
d := norm.NFC.NextBoundaryInString(s[i:], true)
fmt.Printf("%[1]s: %+[1]q\n", s[i:i+d])
i += d
}
}
package main
import (
"bytes"
"fmt"
"io"
"unicode/utf8"
"golang.org/x/text/unicode/norm"
)
// EqualSimple uses a norm.Iter to compare two non-normalized
// strings for equivalence.
func EqualSimple(a, b string) bool {
var ia, ib norm.Iter
ia.InitString(norm.NFKD, a)
ib.InitString(norm.NFKD, b)
for !ia.Done() && !ib.Done() {
if !bytes.Equal(ia.Next(), ib.Next()) {
return false
}
}
return ia.Done() && ib.Done()
}
// FindPrefix finds the longest common prefix of ASCII characters
// of a and b.
func FindPrefix(a, b string) int {
i := 0
for ; i < len(a) && i < len(b) && a[i] < utf8.RuneSelf && a[i] == b[i]; i++ {
}
return i
}
// EqualOpt is like EqualSimple, but optimizes the special
// case for ASCII characters.
func EqualOpt(a, b string) bool {
n := FindPrefix(a, b)
a, b = a[n:], b[n:]
var ia, ib norm.Iter
ia.InitString(norm.NFKD, a)
ib.InitString(norm.NFKD, b)
for !ia.Done() && !ib.Done() {
if !bytes.Equal(ia.Next(), ib.Next()) {
return false
}
if n := int64(FindPrefix(a[ia.Pos():], b[ib.Pos():])); n != 0 {
ia.Seek(n, io.SeekCurrent)
ib.Seek(n, io.SeekCurrent)
}
}
return ia.Done() && ib.Done()
}
var compareTests = []struct{ a, b string }{
{"aaa", "aaa"},
{"aaa", "aab"},
{"a\u0300a", "\u00E0a"},
{"a\u0300\u0320b", "a\u0320\u0300b"},
{"\u1E0A\u0323", "\x44\u0323\u0307"},
// A character that decomposes into multiple segments
// spans several iterations.
{"\u3304", "\u30A4\u30CB\u30F3\u30AF\u3099"},
}
func main() {
for i, t := range compareTests {
r0 := EqualSimple(t.a, t.b)
r1 := EqualOpt(t.a, t.b)
fmt.Printf("%d: %v %v\n", i, r0, r1)
}
}
Package-Level Type Names (total 18, in which 3 are exported)
/* sort exporteds by: | */
A Form denotes a canonical representation of Unicode code points.
The Unicode-defined normalization and equivalence forms are:
NFC Unicode Normalization Form C
NFD Unicode Normalization Form D
NFKC Unicode Normalization Form KC
NFKD Unicode Normalization Form KD
For a Form f, this documentation uses the notation f(x) to mean
the bytes or string x converted to the given form.
A position n in x is called a boundary if conversion to the form can
proceed independently on both sides:
f(x) == append(f(x[0:n]), f(x[n:])...)
References: https://unicode.org/reports/tr15/ and
https://unicode.org/notes/tn5/.
Append returns f(append(out, b...)).
The buffer out must be nil, empty, or equal to f(out).
AppendString returns f(append(out, []byte(s))).
The buffer out must be nil, empty, or equal to f(out).
Bytes returns f(b). May return b if f(b) = b.
FirstBoundary returns the position i of the first boundary in b
or -1 if b contains no boundary.
FirstBoundaryInString returns the position i of the first boundary in s
or -1 if s contains no boundary.
IsNormal returns true if b == f(b).
IsNormalString returns true if s == f(s).
LastBoundary returns the position i of the last boundary in b
or -1 if b contains no boundary.
NextBoundary reports the index of the boundary between the first and next
segment in b or -1 if atEOF is false and there are not enough bytes to
determine this boundary.
NextBoundaryInString reports the index of the boundary between the first and
next segment in b or -1 if atEOF is false and there are not enough bytes to
determine this boundary.
Properties returns properties for the first rune in s.
PropertiesString returns properties for the first rune in s.
QuickSpan returns a boundary n such that b[0:n] == f(b[0:n]).
It is not guaranteed to return the largest such n.
QuickSpanString returns a boundary n such that s[0:n] == f(s[0:n]).
It is not guaranteed to return the largest such n.
Reader returns a new reader that implements Read
by reading data from r and returning f(data).
Reset implements the Reset method of the transform.Transformer interface.
Span implements transform.SpanningTransformer. It returns a boundary n such
that b[0:n] == f(b[0:n]). It is not guaranteed to return the largest such n.
SpanString returns a boundary n such that s[0:n] == f(s[0:n]).
It is not guaranteed to return the largest such n.
String returns f(s).
Transform implements the Transform method of the transform.Transformer
interface. It may need to write segments of up to MaxSegmentSize at once.
Users should either catch ErrShortDst and allow dst to grow or have dst be at
least of size MaxTransformChunkSize to be guaranteed of progress.
Writer returns a new writer that implements Write(b)
by writing f(b) to w. The returned writer may use an
internal buffer to maintain state across Write calls.
Calling its Close method writes any buffered data to w.
( T) doAppend(out []byte, src input, n int) []byte( T) firstBoundary(src input, nsrc int) int( T) nextBoundary(src input, nsrc int, atEOF bool) int
transform implements the transform.Transformer interface. It is only called
when quickSpan does not pass for a given string.
T : golang.org/x/text/transform.SpanningTransformer
T : golang.org/x/text/transform.Transformer
T : vendor/golang.org/x/text/transform.SpanningTransformer
T : vendor/golang.org/x/text/transform.Transformer
func (*Iter).Init(f Form, src []byte)
func (*Iter).InitString(f Form, src string)
const NFC
const NFD
const NFKC
const NFKD
An Iter iterates over a string or byte slice, while normalizing it
to a given Form.
asciiFiterFuncbuf[128]byte
// first character saved from previous iteration
// remainder of multi-segment decomposition
// implementation of next depends on form
// current position in input source
rbreorderBuffer
Done returns true if there is no more input to process.
Init initializes i to iterate over src after normalizing it to Form f.
InitString initializes i to iterate over src after normalizing it to Form f.
Next returns f(i.input[i.Pos():n]), where n is a boundary of i.input.
For any input a and b for which f(a) == f(b), subsequent calls
to Next will return the same segments.
Modifying runes are grouped together with the preceding starter, if such a starter exists.
Although not guaranteed, n will typically be the smallest possible n.
Pos returns the byte position at which the next call to Next will commence processing.
Seek sets the segment to be returned by the next call to Next to start
at position p. It is the responsibility of the caller to set p to the
start of a segment.
returnSlice returns a slice of the underlying input type as a byte slice.
If the underlying is of type []byte, it will simply return a slice.
If the underlying is of type string, it will copy the slice to the buffer
and return that.
(*T) setDone()
*T : io.Seeker
func doNormComposed(i *Iter) []byte
func doNormDecomposed(i *Iter) []byte
func nextASCIIBytes(i *Iter) []byte
func nextASCIIString(i *Iter) []byte
func nextCGJCompose(i *Iter) []byte
func nextCGJDecompose(i *Iter) []byte
func nextComposed(i *Iter) []byte
func nextDecomposed(i *Iter) (next []byte)
func nextDone(i *Iter) []byte
func nextHangul(i *Iter) []byte
func nextMulti(i *Iter) []byte
func nextMultiNorm(i *Iter) []byte
Properties provides access to normalization properties of a rune.
// leading canonical combining class (ccc if not decomposition)
// quick check flags
indexuint16
// number of leading non-starters.
// start position in reorderBuffer; used in composition.go
// length of UTF-8 encoding of this rune
// trailing canonical combining class (ccc if not decomposition)
BoundaryAfter returns true if runes cannot combine with or otherwise
interact with this or previous runes.
BoundaryBefore returns true if this rune starts a new segment and
cannot combine with any rune on the left.
CCC returns the canonical combining class of the underlying rune.
Decomposition returns the decomposition for the underlying rune
or nil if there is none.
LeadCCC returns the CCC of the first rune in the decomposition.
If there is no decomposition, LeadCCC equals CCC.
Size returns the length of UTF-8 encoding of the rune.
TrailCCC returns the CCC of the last rune in the decomposition.
If there is no decomposition, TrailCCC equals CCC.
( T) combinesBackward() bool( T) combinesForward() bool( T) hasDecomposition() bool( T) isInert() bool( T) isYesC() bool( T) isYesD() bool( T) multiSegment() bool( T) nLeadingNonStarters() uint8( T) nTrailingNonStarters() uint8
func Form.Properties(s []byte) Properties
func Form.PropertiesString(s string) Properties
func compInfo(v uint16, sz int) Properties
func lastRuneStart(fd *formInfo, buf []byte) (Properties, int)
func lookupInfoNFC(b input, i int) Properties
func lookupInfoNFKC(b input, i int) Properties
formInfo holds Form-specific functions and tables.
// form type
// form type
formForminfolookupFuncnextMainiterFunc
quickSpan returns a boundary n such that src[0:n] == f(src[0:n]) and
whether any non-normalized parts were found. If atEOF is false, n will
not point past the last segment if this segment might be become
non-normalized by appending other runes.
func lastBoundary(fd *formInfo, b []byte) int
func lastRuneStart(fd *formInfo, buf []byte) (Properties, int)
insertErr is an error code returned by insert. Using this type instead
of error improves performance up to 20% for many of the benchmarks.
const iShortDst
const iShortSrc
const iSuccess
nfcTrie. Total size: 10680 bytes (10.43 KiB). Checksum: a555db76d4becdd2.
lookup returns the trie value for the first UTF-8 encoding in s and
the width in bytes of this encoding. The size will be 0 if s does not
hold enough bytes to complete the encoding. len(s) must be greater than 0.
lookupString returns the trie value for the first UTF-8 encoding in s and
the width in bytes of this encoding. The size will be 0 if s does not
hold enough bytes to complete the encoding. len(s) must be greater than 0.
lookupStringUnsafe returns the trie value for the first UTF-8 encoding in s.
s must start with a full and valid UTF-8 encoded rune.
lookupUnsafe returns the trie value for the first UTF-8 encoding in s.
s must start with a full and valid UTF-8 encoded rune.
lookupValue determines the type of block n and looks up the value for b.
func newNfcTrie(i int) *nfcTrie
var nfcData *nfcTrie
nfkcTrie. Total size: 18768 bytes (18.33 KiB). Checksum: c51186dd2412943d.
lookup returns the trie value for the first UTF-8 encoding in s and
the width in bytes of this encoding. The size will be 0 if s does not
hold enough bytes to complete the encoding. len(s) must be greater than 0.
lookupString returns the trie value for the first UTF-8 encoding in s and
the width in bytes of this encoding. The size will be 0 if s does not
hold enough bytes to complete the encoding. len(s) must be greater than 0.
lookupStringUnsafe returns the trie value for the first UTF-8 encoding in s.
s must start with a full and valid UTF-8 encoded rune.
lookupUnsafe returns the trie value for the first UTF-8 encoding in s.
s must start with a full and valid UTF-8 encoded rune.
lookupValue determines the type of block n and looks up the value for b.
func newNfkcTrie(i int) *nfkcTrie
var nfkcData *nfkcTrie
buf[]byterbreorderBufferwio.Writer
Close forces data that remains in the buffer to be written.
Write implements the standard write interface. If the last characters are
not at a normalization boundary, the bytes will be buffered for the next
write. The remaining bytes will be written on close.
*T : io.Closer
*T : io.WriteCloser
*T : io.Writer
We pack quick check data in 4 bits:
5: Combines forward (0 == false, 1 == true)
4..3: NFC_QC Yes(00), No (10), or Maybe (11)
2: NFD_QC Yes (0) or No (1). No also means there is a decomposition.
1..0: Number of trailing non-starters.
When all 4 bits are zero, the character is inert, meaning it is never
influenced by normalization.
reorderBuffer is used to normalize a single segment. Characters inserted with
insert are decomposed and reordered based on CCC. The compose method can
be used to recombine characters. Note that the byte buffer does not hold
the UTF-8 characters in order. Only the rune array is maintained in sorted
order. flush writes the resulting segment to a byte array.
// UTF-8 buffer. Referenced by runeInfo.pos.
fformInfoflushFfunc(*reorderBuffer) bool
// Number or bytes.
// Number of runeInfos.
nsrcintout[]byte
// Per character info.
srcinput
// For limiting length of non-starter sequence.
tmpBytesinput
appendRune inserts a rune at the end of the buffer. It is used for Hangul.
assignRune sets a rune at position pos. It is used for Hangul and recomposition.
bytesAt returns the UTF-8 encoding of the rune at position n.
It is used for Hangul and recomposition.
combineHangul algorithmically combines Jamo character components into Hangul.
See https://unicode.org/reports/tr15/#Hangul for details on combining Hangul.
compose recombines the runes in the buffer.
It should only be used to recompose a single segment, as it will not
handle alternations between Hangul and non-Hangul characters correctly.
decomposeHangul algorithmically decomposes a Hangul rune into
its Jamo components.
See https://unicode.org/reports/tr15/#Hangul for details on decomposing Hangul.
(*T) doFlush() bool
flush appends the normalized segment to out and resets rb.
flushCopy copies the normalized segment to buf and resets rb.
It returns the number of bytes written to buf.
(*T) init(f Form, src []byte)(*T) initString(f Form, src string)
insertCGJ inserts a Combining Grapheme Joiner (0x034f) into rb.
insertDecomposed inserts an entry in to the reorderBuffer for each rune
in dcomp. dcomp must be a sequence of decomposed UTF-8-encoded runes.
It flushes the buffer on each new segment start.
insertFlush inserts the given rune in the buffer ordered by CCC.
If a decomposition with multiple segments are encountered, they leading
ones are flushed.
It returns a non-zero error code if the rune was not inserted.
insertOrdered inserts a rune in the buffer, ordered by Canonical Combining Class.
It returns false if the buffer is not large enough to hold the rune.
It is used internally by insert and insertString only.
insertSingle inserts an entry in the reorderBuffer for the rune at
position i. info is the runeInfo for the rune at position i.
insertUnsafe inserts the given rune in the buffer ordered by CCC.
It is assumed there is sufficient space to hold the runes. It is the
responsibility of the caller to ensure this. This can be done by checking
the state returned by the streamSafe type.
reset discards all characters from the buffer.
runeAt returns the rune at position n. It is used for Hangul and recomposition.
(*T) setFlusher(out []byte, f func(*reorderBuffer) bool)
func appendFlush(rb *reorderBuffer) bool
func appendQuick(rb *reorderBuffer, i int) int
func cmpNormalBytes(rb *reorderBuffer) bool
func decomposeSegment(rb *reorderBuffer, sp int, atEOF bool) int
func decomposeToLastBoundary(rb *reorderBuffer)
func doAppend(rb *reorderBuffer, out []byte, p int) []byte
func doAppendInner(rb *reorderBuffer, p int) []byte
func flushTransform(rb *reorderBuffer) bool
func patchTail(rb *reorderBuffer) bool
offset[]uint16values[]valueRange
lookupValue determines the type of block n and looks up the value for b.
For n < t.cutoff, the block is a simple lookup table. Otherwise, the block
is a list of ranges with an accompanying value. Given a matching range r,
the value for b is by r.value + (b - r.lo) * stride.
var nfcSparse
var nfkcSparse
ssState is used for reporting the segment state after inserting a rune.
It is returned by streamSafe.next.
const ssOverflow
const ssStarter
const ssSuccess
streamSafe implements the policy of when a CGJ should be inserted.
backwards is used for checking for overflow and segment starts
when traversing a string backwards. Users do not need to call first
for the first rune. The state of the streamSafe retains the count of
the non-starters loaded.
first inserts the first rune of a segment. It is a faster version of next if
it is known p represents the first rune in a segment.
( T) isMax() bool
insert returns a ssState value to indicate whether a rune represented by p
can be inserted.
combine returns the combined rune or 0 if it doesn't exist.
The caller is responsible for calling
recompMapOnce.Do(buildRecompMap) sometime before this is called.
compInfo converts the information contained in v and sz
to a Properties. See the comment at the top of the file
for more information on the format.
decomposeHangul writes the decomposed Hangul to buf and returns the number
of bytes written. len(buf) should be at least 9.
decomposeSegment scans the first segment in src into rb. It inserts 0x034f
(Grapheme Joiner) when it encounters a sequence of more than 30 non-starters
and returns the number of bytes consumed from src or iShortDst or iShortSrc.
decomposeToLastBoundary finds an open segment at the end of the buffer
and scans it into rb. Returns the buffer minus the last segment.
nextMulti is used for iterating over multi-segment decompositions
for decomposing normal forms.
nextMultiNorm is used for iterating over multi-segment decompositions
for composing normal forms.
patchTail fixes a case where a rune may be incorrectly normalized
if it is followed by illegal continuation bytes. It returns the
patched buffer and whether the decomposition is still in progress.
Package-Level Variables (total 18, none are exported)
Package-Level Constants (total 52, in which 8 are exported)
GraphemeJoiner is inserted after maxNonStarters non-starter runes.
MaxSegmentSize is the maximum size of a byte buffer needed to consider any
sequence of starter and non-starter runes for the purpose of normalization.
MaxTransformChunkSize indicates the maximum number of bytes that Transform
may need to write atomically for any Form. Making a destination buffer at
least this size ensures that Transform can always make progress and that
the user does not need to grow the buffer on an ErrShortDst.
Indicates a rune caused a segment overflow and a CGJ should be inserted.
Indicates a rune starts a new segment and should not be added.
Indicates a rune was successfully added to the segment.
The pages are generated with Goldsv0.3.2. (GOOS=linux GOARCH=amd64)
Golds is a Go 101 project developed by Tapir Liu.
PR and bug reports are welcome and can be submitted to the issue list.
Please follow @Go100and1 (reachable from the left QR code) to get the latest news of Golds.