u7/README.md at 54da5b77aaa84c46bd4bc01ca6ced3c1be7caaa4

set/u7

Files

Saleem Edah-Tally 45c5877d22 Allow displaying the result of an operation only.

This concerns the 'representation' and 'properties' modes.

Set the environment variable 'UTF8UTIL_RESULT_ONLY' to any value to
display the result only.

The goal is to achieve consistent usage in scripts.

2023-11-18 21:46:44 +01:00

3.7 KiB

Raw Blame History

U7

This project links to the utf8proc library to perform some operations on an UTF-8 input.

$ utf8util --help
A mode of operation is required: unaccent, normalize, representation, properties, about.
Pass '--help' for more information in each mode.

$ utf8util unaccent --help
This operational mode removes character markings, control characters, default ignorable
characters and unassigned codepoints from an UTF-8 input.The utf8proc library, on which
this utility is based, refers to character markings as 'non-spacing, spacing and
enclosing (accents)' marks, and to default ignorable characters 'such as SOFT-HYPHEN or
ZERO-WIDTH-SPACE'. Control characters are stripped or converted to spaces.

By default, every removable byte is stripped, the output characters are decomposed and
Unicode Versioning Stability is enforced.

  -i, --ignore: do not strip 'default ignorable characters'
  -c, --control: do not handle 'control characters'
  -m, --mark: do not strip 'character markings'
  -n, --na: do not strip 'unassigned codepoints'
  -r, --recompose: output recomposed characters
  -h, --help: show this message

The input can be piped in or read from stdin. It must be a single NULL terminated line.

$ utf8util normalize --help
This operational mode normalizes the input string according to the specified type, the
default being NFC.

  -t, --type: one of NFC, NFD, NFKC, NFKD, NFKC_Casefold
  -h, --help: show this message

The input can be piped in or read from stdin. It must be a single NULL terminated line.

$ utf8util representation --help
This operational mode displays representations of the first identified codepoint.

  -p, --codepoint: hexadecimal representation of the codepoint
  -e, --utf8: hexadecimal representation of each byte
  -s, --utf16: hexadecimal representation of each surrogate
  -b, --binary: binary representation of each byte
  -o, --octal: octal representation of each byte
  -d, --decimal: decimal representation of each byte
  -x, --xml: XML decimal representation of each byte
  -L, --tolower: displays the codepoint as a lower-case character if existent
  -U, --toupper: displays the codepoint as an upper-case character if existent
  -T, --totitle: displays the codepoint as a title-case character if existent
  -h, --help: show this message

The input can be piped in or read from stdin. Pass in a single character for simplicity.
If the environment variable 'UTF8UTIL_RESULT_ONLY' is set, only the result is printed
on stdout.

$ utf8util properties --help
This operational mode displays properties of the first identified codepoint.

  -l, --islower: displays 1 if the codepoint refers to a lower-case character,
  0 otherwise
  -u, --isupper: displays 1 if the codepoint refers to an upper-case character,
  0 otherwise
  -c, --category: determines the category of a codepoint (Letter, Number, Symbol...)
  -d, --direction: determines the bidirectional class of a codepoint; see utf8proc.h
  -i, --decompositiontype: determines the decomposition type of a codepoint;
  see utf8proc.h
  -b, --boundclass: determines the boundclass property of a codepoint; see utf8proc.h
  -h, --help: show this message

The input can be piped in or read from stdin. Pass in a single character for simplicity.
If the environment variable 'UTF8UTIL_RESULT_ONLY' is set, only the result is printed
on stdout.

$ utf8util about            
Author: Saleem Edah-Tally [Surgeon, Hobbyist developer]
Version: 2
License: CeCILL

Disclaimer

Use at your own risks.

3.7 KiB Raw Blame History

U7

Disclaimer

3.7 KiB

Raw Blame History