Performance
This page is about computer performance in the context of Raku.
First, profile your code
Make sure you're not wasting time on the wrong code: start by identifying your "critical 3%" by profiling your code's performance. The rest of this document shows you how to do that.
Time with now - INIT now
Expressions of the form now - INIT now
, where INIT
is a phase in the
running of a Raku program, provide a great idiom for timing
code snippets.
Use the m: your code goes here
#raku channel
evalbot to write lines like:
m: say now - INIT now
rakudo-moar abc1234: OUTPUTĀ«0.0018558ā¤Ā»
The now
to the left of INIT
runs 0.0018558 seconds later than the
now
to the right of the INIT
because the latter occurs during the INIT
phase.
Profile locally
When using the MoarVM backend, the
Rakudo compiler's --profile
command line option writes
the profile data to an HTML file.
This file will open to the "Overview" section, which gives some overall data about how the program ran, e.g., total runtime, time spent doing garbage collection. One important piece of information you'll get here is percentage of the total call frames (i.e., blocks) that were interpreted (slowest, in red), speshed (faster, in orange), and jitted (fastest, in green).
The next section, "Routines", is probably where you'll spend the most time. It
has a sortable and filterable table of routine (or block) name+file+line, the
number of times it ran, the inclusive time (time spent in that routine + time
spent in all routines called from it), exclusive time (just the time spent in
that routine), and whether it was interpreted, speshed, or jitted (same color
code as the "Overview" page). Sorting by exclusive time is a good way to know
where to start optimizing. Routines with a filename that starts like
SETTING::src/core/
or gen/moar/
are from the compiler, a good way to just
see the stuff from your own code is to put the filename of the script you
profiled in the "Name" search box.
The "Call Graph" section gives a flame graph representation of much of the same information as the "Routines" section.
The "Allocations" section gives you information about the amount of different types that were allocated, as well as which routines did the allocating.
The "GC" section gives you detailed information about all the garbage collections that occurred.
The "OSR / Deopt" section gives you information about On Stack Replacements (OSRs), which is when routines are "upgraded" from interpreted to speshed or jitted. Deopts are the opposite, when speshed or jitted code has to be "downgraded" to being interpreted.
If the profile data is too big, it could take a long time for a browser to open
the file. In that case, output to a file with a .json
extension using the
--profile=filename
option, then open the file with the
Qt viewer.
To deal with even larger profiles, output to a file with a .sql
extension.
This will write the profile data as a series of SQL statements, suitable for
opening in SQLite.
# create a profile
raku --profile=demo.sql -e 'say (^20).combinations(3).elems'
# create a SQLite database
sqlite3 demo.sqlite
# load the profile data
sqlite> .read demo.sql
# the query below is equivalent to the default view of the "Routines" tab in the HTML profile
sqlite> select
case when r.name = "" then "<anon>" else r.name end as name,
r.file,
r.line,
sum(entries) as entries,
sum(case when rec_depth = 0 then inclusive_time else 0 end) as inclusive_time,
sum(exclusive_time) as exclusive_time
from
calls c,
routines r
where
c.routine_id = r.id
group by
r.id
order by
inclusive_time desc
limit 30;
The in-progress, next-gen profiler is moarperf, which can accept .sql or SQLite files and has a bunch of new functionality compared to the original profiler. However, it has more dependencies than the relatively stand-alone original profiler, so you'll have to install some modules before using it.
To learn how to interpret the profile info, use the prof-m: your code goes
here
evalbot (explained above) and ask questions on the IRC channel.
Profile compiling
If you want to profile the time and memory it takes to compile your code, use
Rakudo's --profile-compile
or --profile-stage
options.
Create or view benchmarks
Use perl6-bench.
If you run perl6-bench for multiple compilers (typically, versions of Perl, Raku, or NQP), results for each are visually overlaid on the same graphs, to provide for quick and easy comparison.
Share problems
Once you've used the above techniques to identify the code to improve, you can then begin to address (and share) the problem with others:
For each problem, distill it down to a one-liner or the gist and either provide performance numbers or make the snippet small enough that it can be profiled using
prof-m: your code or gist URL goes here
.Think about the minimum speed increase (or ram reduction or whatever) you need/want, and think about the cost associated with achieving that goal. What's the improvement worth in terms of people's time and energy?
Let others know if your Raku use-case is in a production setting or just for fun.
Solve problems
This bears repeating: make sure you're not wasting time on the wrong code. Start by identifying the "critical 3%" of your code.
Line by line
A quick, fun, productive way to try improve code line-by-line is to collaborate with others using the #raku evalbot camelia.
Routine by routine
With multi-dispatch, you can drop in new variants of routines "alongside" existing ones:
# existing code generically matches a two arg foo call:
multi foo(Any $a, Any $b) { ... }
# new variant takes over for a foo("quux", 42) call:
multi foo("quux", Int $b) { ... }
The call overhead of having multiple foo
definitions is generally
insignificant (though see discussion of where
below), so if your new
definition handles its particular case more efficiently than the previously
existing set of definitions, then you probably just made your code that much
more efficient for that case.
Speed up type-checks and call resolution
Most where clauses ā and thus most subsets ā force dynamic (runtime) type checking and call resolution for any call it might match. This is slower, or at least later, than compile-time.
Method calls are generally resolved as late as possible (dynamically at runtime), whereas sub calls are generally resolved statically at compile-time.
Choose better algorithms
One of the most reliable techniques for making large performance improvements, regardless of language or compiler, is to pick a more appropriate algorithm.
A classic example is Boyer-Moore. To match a small string in a large string, one obvious way to do it is to compare the first character of the two strings and then, if they match, compare the second characters, or, if they don't match, compare the first character of the small string with the second character in the large string, and so on. In contrast, the Boyer-Moore algorithm starts by comparing the *last* character of the small string with the correspondingly positioned character in the large string. For most strings, the Boyer-Moore algorithm is close to N times faster algorithmically, where N is the length of the small string.
The next couple sections discuss two broad categories for algorithmic improvement that are especially easy to accomplish in Raku. For more on this general topic, read the wikipedia page on algorithmic efficiency, especially the 'See also' section near the end.
Change sequential/blocking code to parallel/non-blocking
This is another very important class of algorithmic improvement.
See the slides for Parallelism, Concurrency, and Asynchrony in Raku and/or the matching video.
Use existing high performance code
There are plenty of high performance C libraries that you can use within Raku and NativeCall makes it easy to create wrappers for them. There's experimental support for C++ libraries, too.
If you want to use Perl modules in Raku, mix in Raku types and the Metaobject Protocol.
More generally, Raku is designed to smoothly interoperate with other languages and there are a number of modules aimed at facilitating the use of libs from other langs.
Make the Rakudo compiler generate faster code
To date, the focus for the compiler has been correctness, not how fast it generates code or how fast or lean the code it generates runs. But that's expected to change, eventually... You can talk to compiler devs on the libera.chat IRC channels #raku and #moarvm about what to expect. Better still, you can contribute yourself:
Rakudo is largely written in Raku. So if you can write Raku, then you can hack on the compiler, including optimizing any of the large body of existing high-level code that impacts the speed of your code (and everyone else's).
Most of the rest of the compiler is written in a small language called NQP that's basically a subset of Raku. If you can write Raku, you can fairly easily learn to use and improve the mid-level NQP code too, at least from a pure language point of view. To dig into NQP and Rakudo's guts, start with NQP and internals course.
If low-level C hacking is your idea of fun, checkout MoarVM and visit the libera.chat IRC channel #moarvm (logs).
Still need more ideas?
Some known current Rakudo performance weaknesses not yet covered in this page include the use of gather/take, junctions, regexes, and string handling in general.
If you think some topic needs more coverage on this page, please submit a PR or tell someone your idea. Thanks. :)
Not getting the results you need/want?
If you've tried everything on this page to no avail, please consider discussing things with a compiler dev on #raku, so we can learn from your use-case and what you've found out about it so far.
Once a dev knows of your plight, allow enough time for an informed response (a few days or weeks, depending on the exact nature of your problem and potential solutions).
If that hasn't worked out, please consider filing an issue about your experience at our user experience repo before moving on.
Thanks. :)