New -chartab- package on SSC to tabulate character frequency counts

Thanks to Kit Baum, the chartab package is now available on SSC. To install, type in Stata's Command window:

Code:

ssc install chartab

This installs two commands that tabulate character frequency counts. The chartab command tabulates Unicode characters (requires Stata 14 or higher) and the chartabb command tabulates byte codes (requires Stata 10 or higher).

If you are using an older version of Stata (version 13 or earlier), a character is encoded using a single byte. This allows for 256 distinct values. char(0) to char(127) are ASCII codes but there is no standard for what char(128) to char(255) represent.

If you are using Stata 14 or higher, each character is encoded in UTF-8. This is a storage-efficient Unicode encoding where the 128 ASCII characters are encoded using a single byte (using the same ASCII byte code). All other Unicode characters are encoded using a multi-byte sequence (from two to four bytes, with each byte code >= 128). So by design, UTF-8 is backwards compatible with ASCII.

Both chartab and chartabb can process text from any combination of string variables, files, string scalars, and string literals in a single run. Here's an example with a string literal:

Code:

. chartab , literal("j'ai hâte à l'été")

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+------------------------------------------------------
        32       \u0020             |             3    SPACE
        39       \u0027       '     |             2    APOSTROPHE
        97       \u0061       a     |             1    LATIN SMALL LETTER A
       101       \u0065       e     |             1    LATIN SMALL LETTER E
       104       \u0068       h     |             1    LATIN SMALL LETTER H
       105       \u0069       i     |             1    LATIN SMALL LETTER I
       106       \u006a       j     |             1    LATIN SMALL LETTER J
       108       \u006c       l     |             1    LATIN SMALL LETTER L
       116       \u0074       t     |             2    LATIN SMALL LETTER T
       224       \u00e0       à     |             1    LATIN SMALL LETTER A WITH GRAVE
       226       \u00e2       â     |             1    LATIN SMALL LETTER A WITH CIRCUMFLEX
       233       \u00e9       é     |             2    LATIN SMALL LETTER E WITH ACUTE
------------------------------------+------------------------------------------------------

                                    freq. count   distinct
ASCII characters              =              13          9
Multibyte UTF-8 characters    =               4          3
Unicode replacement character =               0          0
Total Unicode characters      =              17         12


.

I can do the same in Stata 10 using chartabb. But since this is an older version of Stata, each character is encoded using a single byte code. I'm on a Mac, so characters are encoded using the Mac OS Roman encoding.

Code:

. chartabb , literal("j'ai hâte à l'été")

   decimal  hexadecimal   character |     frequency
------------------------------------+--------------------------------------------------------------------
        32           20             |             3
        39           27       '     |             2
        97           61       a     |             1
       101           65       e     |             1
       104           68       h     |             1
       105           69       i     |             1
       106           6A       j     |             1
       108           6C       l     |             1
       116           74       t     |             2
       136           88       à     |             1
       137           89       â     |             1
       142           8E       é     |             2
------------------------------------+--------------------------------------------------------------------
ASCII control characters     =               0
ASCII printable characters   =              13
Extended characters          =               4
Total characters (bytes)     =              17


.

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / New -chartab- package on SSC to tabulate character frequency counts
New -chartab- package on SSC to tabulate character frequency counts

0 Response to New -chartab- package on SSC to tabulate character frequency counts

Post a Comment

Home / Data Cleaning / Data management / Data Processing / New -chartab- package on SSC to tabulate character frequency counts New -chartab- package on SSC to tabulate character frequency counts

Related Posts with New -chartab- package on SSC to tabulate character frequency counts

0 Response to New -chartab- package on SSC to tabulate character frequency counts

Post a Comment

Home / Data Cleaning / Data management / Data Processing / New -chartab- package on SSC to tabulate character frequency counts
New -chartab- package on SSC to tabulate character frequency counts