I just ran an example from the SJ paper about boottest ("Fast and Wild") and was surprised at its poor performance. I'm fortunate enough to be working in Stata/MP, and I discovered that the more processors I enabled it to use, the slower it got. I'm using a Dell XPS 17 9700, which has a pretty good cooling system. Its CPU is an Intel i7-10875H, which has 8 cores and hyperthreading. I'm running Stata/MP 12-core 16.1. It's got 64GB of RAM and Windows 10 Pro.

Here is the log from a distilled demonstration. It sets the number of cores to 1, 2, ..., 12. On each iteration it calls a program that creates a 2500 x 1 matrix X and then computes X + X :* X 10,000 times. Simplifying that calculation to X + X or X :* X makes the problem go away.

I'm wondering if anyone else with access to Stata/MP gets similar results, or has insights. Possibly it doesn't happen on all computers. I understand that implementing invisible parallelization in a compiler is a tricky business. But Stata/MP doesn't come cheap!

Code:
cap mata mata drop demo()

mata
mata set matastrict on
mata set matalnum off
mata set mataoptimize on

void demo() {
    real matrix X; real scalar i
    X = runiform(2500,1)
    for (i=10000; i; i--)
        (void) X + X :* X
}
end

timer clear
forvalues p=1/12 {
  qui set processors `p'
  set seed 1202938431
  timer on `p'
  mata demo()
  timer off `p'
}
timer list
Output:
Code:
. timer list
   1:      0.14 /        1 =       0.1390
   2:      0.16 /        1 =       0.1640
   3:      1.63 /        1 =       1.6330
   4:      2.02 /        1 =       2.0150
   5:      2.47 /        1 =       2.4680
   6:      2.92 /        1 =       2.9210
   7:      3.38 /        1 =       3.3780
   8:      3.84 /        1 =       3.8370
   9:      4.26 /        1 =       4.2640
  10:      4.70 /        1 =       4.7040
  11:      5.21 /        1 =       5.2100
  12:      5.63 /        1 =       5.6260
That's right: using 1 core takes 0.14 seconds. Using 8 cores takes 3.84 seconds. Using 12 (with hyperthreading) takes 5.63 seconds.

Here's output I get from Stata 15.0--it's actually better! But still bad:
Code:
   1:      0.13 /        1 =       0.1280
   2:      0.14 /        1 =       0.1440
   3:      1.09 /        1 =       1.0900
   4:      1.35 /        1 =       1.3460
   5:      1.55 /        1 =       1.5540
   6:      1.78 /        1 =       1.7810
   7:      2.10 /        1 =       2.1010
   8:      2.35 /        1 =       2.3540
   9:      2.59 /        1 =       2.5920
  10:      2.89 /        1 =       2.8890
  11:      3.16 /        1 =       3.1570
  12:      3.41 /        1 =       3.4120
I monitored CPU usage during these tests and saw no evidence of throttling.

I'm worried that my Mata-based programs are getting seriously slowed down.

If you've got MP and can run this test, I'd be interested in the results.