Code:
ssc install chartab
If you are using an older version of Stata (version 13 or earlier), a character is encoded using a single byte. This allows for 256 distinct values. char(0) to char(127) are ASCII codes but there is no standard for what char(128) to char(255) represent.
If you are using Stata 14 or higher, each character is encoded in UTF-8. This is a storage-efficient Unicode encoding where the 128 ASCII characters are encoded using a single byte (using the same ASCII byte code). All other Unicode characters are encoded using a multi-byte sequence (from two to four bytes, with each byte code >= 128). So by design, UTF-8 is backwards compatible with ASCII.
Both chartab and chartabb can process text from any combination of string variables, files, string scalars, and string literals in a single run. Here's an example with a string literal:
Code:
. chartab , literal("j'ai hâte à l'été") decimal hexadecimal character | frequency unique name ------------------------------------+------------------------------------------------------ 32 \u0020 | 3 SPACE 39 \u0027 ' | 2 APOSTROPHE 97 \u0061 a | 1 LATIN SMALL LETTER A 101 \u0065 e | 1 LATIN SMALL LETTER E 104 \u0068 h | 1 LATIN SMALL LETTER H 105 \u0069 i | 1 LATIN SMALL LETTER I 106 \u006a j | 1 LATIN SMALL LETTER J 108 \u006c l | 1 LATIN SMALL LETTER L 116 \u0074 t | 2 LATIN SMALL LETTER T 224 \u00e0 à | 1 LATIN SMALL LETTER A WITH GRAVE 226 \u00e2 â | 1 LATIN SMALL LETTER A WITH CIRCUMFLEX 233 \u00e9 é | 2 LATIN SMALL LETTER E WITH ACUTE ------------------------------------+------------------------------------------------------ freq. count distinct ASCII characters = 13 9 Multibyte UTF-8 characters = 4 3 Unicode replacement character = 0 0 Total Unicode characters = 17 12 .
Code:
. chartabb , literal("j'ai hâte à l'été") decimal hexadecimal character | frequency ------------------------------------+-------------------------------------------------------------------- 32 20 | 3 39 27 ' | 2 97 61 a | 1 101 65 e | 1 104 68 h | 1 105 69 i | 1 106 6A j | 1 108 6C l | 1 116 74 t | 2 136 88 à | 1 137 89 â | 1 142 8E é | 2 ------------------------------------+-------------------------------------------------------------------- ASCII control characters = 0 ASCII printable characters = 13 Extended characters = 4 Total characters (bytes) = 17 .
0 Response to New -chartab- package on SSC to tabulate character frequency counts
Post a Comment