Journey toward a fast and good minifier

kdy1,
Back

This blog post will tell you about the techniques involved in making the minifier 100x faster, especially the name mangler. For an arbitrary large input, it now takes 2.5s while using one cpu core instead of 13s while using 10 cpu cores.

SWC has a minifier. For now, the SWC minifier shows good performance while having similar or better compression rate than terser. See https://github.com/privatenumber/minification-benchmarks for current scores.

But SWC minifier was quite slow in the past. The main reason is that, no one in the planet, including me, knows every characteristics of hygiene-based variable management system, which is something I invented for SWC. It may sound strange, but at the moment of inventing it, I didn't even know how to calculate big-O. At the moment, I was too young to be a CS student, and more importantly, SWC is my first serious project.

My journey

I had hard time making minifier faster. I learned lots of things while working on the performance issue of the minifier. I'll write them down on this post. Some of them are related to hygiene-based identifier management system, and now I think I should a write a formal assay about it. It allows dropping time complexity by a margin, so I think it can be applied to some passes in other compilers.

For the curious, the title will be

Scopeless scoping with span-hygiene

Reversed hash, but with iteration? (#3633)

I profiled the minifier using cargo profile instruments --release, which is a subcommand added by my cargo-profile. Also, I used three.js benchmark from esbuild for profiling.

Actually, I was working on performance issue of bundler, not the minifier.

In swc_bundler, I ran

MINIFY=1 ./scripts/instrument.sh ~/path/to/three.js/bundle

instrument.sh builds the example named bundle, and runs xcode instrument using the built executable and three.js bundle.

Bundled as 1 modules
Bundler.bundle() took 541ms
   INFO Done in 83ns, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in inline global defs
    in minify

   INFO Done in 77.599583ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in analyze
    in precompress
    in minify

   INFO Done in 97.278541ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in precompress
    in minify

   INFO Running minifier in the normal mode
    at crates/swc_ecma_minifier/src/metadata/mod.rs:183
    in minify

   INFO Done in 73.851541ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in analyze
    in compress ast
    in minify

   INFO compress: Running expression simplifier (pass = 1)
    at crates/swc_ecma_minifier/src/compress/mod.rs:275
    in optimize with , pass: 1
    in compress ast
    in minify

   INFO compress: expr_simplifier took 25.204666ms (pass = 1)
    at crates/swc_ecma_minifier/src/compress/mod.rs:297
    in optimize with , pass: 1
    in compress ast
    in minify

   INFO Done in 27.640625ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in apply pure optimize
    in optimize with , pass: 1
    in compress ast
    in minify

   INFO Done in 77.462958ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in analyze
    in apply full optimizer
    in optimize with , pass: 1
    in compress ast
    in minify

   INFO Done in 267.659041ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in apply full optimizer
    in optimize with , pass: 1
    in compress ast
    in minify

   INFO Done in 320.593083ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in optimize with , pass: 1
    in compress ast
    in minify

   INFO Done in 0ns, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in optimize with , pass: 2
    in compress ast
    in minify

   INFO Done in 444.3625ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in compress ast
    in minify

   INFO Done in 18.426625ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in postcompress
    in minify

   INFO Done in 12.344981416s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in mangle names
    in minify

   INFO Done in 12.944418041s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in minify

Minification took 12974ms
Created output.js (5914kb)
Took 13587ms

Uh... So name mangling uses 12 seconds? I was sure that there was some serious problem in the name mangler.

instrument

Yeah, instrument is proving it. So I looked at the source code of name mangler.

fn can_rename(&self, id: &Id, symbol: &JsWord, renamed: &AHashMap<Id, JsWord>) -> bool {
    if cfg!(target_arch = "wasm32")
        || self
            .data
            .usages
            .iter()
            .chain(self.data.decls.iter())
            .count()
            <= 64
    {
        for used_id in self.data.usages.iter().chain(self.data.decls.iter()) {
            if *used_id == *id {
                continue;
            }

            if let Some(renamed_id) = renamed.get(used_id) {
                if renamed_id == symbol {
                    return false;
                }
            }
        }
    } else {
        if self
            .data
            .usages
            .par_iter()
            .chain(self.data.decls.par_iter())
            .any(|used_id| {
                if *used_id == *id {
                    return false;
                }

                if let Some(renamed_id) = renamed.get(used_id) {
                    if renamed_id == symbol {
                        return true;
                    }
                }

                false
            })
        {
            return false;
        }
    }

    if cfg!(target_arch = "wasm32") {
        for c in self.children.iter() {
            if !c.can_rename(id, symbol, renamed) {
                return false;
            }
        }

        true
    } else {
        self.children
            .par_iter()
            .all(|c| c.can_rename(id, symbol, renamed))
    }
}

This is the problematic code. While looking at the code, I found that it's doing something crazy. Actually it was doing reversed hash operation, but by the iterating over a vector (in parallel).

Ah... Fix was simple. I created a type named RenameMap, which tracks the mapping between Id and JsWord. As it also trakced the reverse direction, (JsWord -> Vec<Id>), it became much faster.

Result:

   INFO Done in 4.501971666s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in mangle names
    in minify

   INFO Done in 5.107085375s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in minify

Minification took 5137ms
Created output.js (5914kb)
Took 5727ms

Now it's much better.

Multiple hashmaps, but with identical role? (#3983)

The performance of the name mangler is quite improved, but it was still too slow. I ran profiling again, with the same command.

MINIFY=1 ./scripts/instrument.sh ~/path/to/three.js/bundle
   INFO  Done in 4.49787s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in mangle names
    in minify

   INFO  Done in 5.278398666s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in minify

Minification took 5309ms
Created output.js (5889kb)
Took 5891ms

instrument

Okay, checking if a hash map contains an element is O(1), so we are performing doing it too many times. How can we reduce it? While looking at code, I found that usages and decls in ScopeData are always used in identical way. All operations performed on two hashmaps are identical, so I merged it into all.

And I profiled after changing the code.


INFO Done in 2.942127541s, kind: "perf"
at crates/swc_timer/src/lib.rs:32
in mangle names
in minify

INFO Done in 3.730316875s, kind: "perf"
at crates/swc_timer/src/lib.rs:32
in minify

Minification took 3760ms
Created output.js (5889kb)
Took 4344ms

It worked!

Span hygiene is cool (#3988)

In this time, I used instrument.sh in swc_ecma_minifer, not swc_bundler. I used typescript as the input for the minifier.

I ran

./scripts/instrument.sh path/to/typescript.js

Result is

   INFO  Done in 360.108708ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in compress ast
    in minify

   INFO  Done in 304.238125ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in postcompress
    in minify

   INFO  Done in 2.227133208s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in mangle names
    in minify

   INFO  Done in 2.976443833s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in minify

Let's look at instrument result.

instrument

can_rename is using lots of time. To know why, I expanded it.

instrument expanxded

The recursion was the problem. Can we avoid it? The answer is surprisingly yes, and more surprisingly it's not hard to do.

Instead of checking hash maps in child scopes, we can clone all Ids in child scopes to the parent scope, before checking.


pub(super) fn prepare_renaming(&mut self) {
    self.children.iter_mut().for_each(|child| {
        child.prepare_renaming();

        self.data.all.extend(child.data.all.iter().cloned());
    });
}

Recursion again

We are going to remove recursion in hot execution path, by cloning all required information recursively in single-execution path.

Let's look at the result.

   INFO  Done in 361.856375ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in compress ast
    in minify

   INFO  Done in 300.18725ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in postcompress
    in minify

   INFO  Done in 841.693541ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in mangle names
    in minify

   INFO  Done in 1.588692708s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in minify

Again, it worked!

This optimization is power of span-hygiene. I can store all variables within a single hashmap. Those identifiers can be still distinguished by their hygiene.

25M+ calls to HashSet#contains? Why? (#4342)

Let's start with profiling.

./scripts/instrument.sh path/to/typescript.js

Same command this time, from ./crates/swc_ecma_minifier.

Although it became much faster, it's quite strange. HashSet#contains is O(1). But 25% of total runtime? So... It means we are calling it too many times.

To know exact number, I added println!("all:contains") to the above of

if self.data.all.contains(left) {
    return false;
}

and counted it using the search feature of iTerm2. And I stopped it after 25M.

Obviously, this is not a number I expected. Not even close. Why the hell is it called 25M times?

After some thinking, I found that after mification, some symbols are used lots of time. e.g. a is used everywhere, because it's short. Lots of identifiers are mapped to a, because it's short. And we use one global reverse hashmap, As a result, we iterate over all of those identifiers which is mapped to same symbol.

We don't need to check all of them. Checking identifiers in parent scope is enough. So I decided to clone hash map for each scope.

Let's look at the result.

   INFO  Done in 382.930791ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in compress ast
    in minify

   INFO  Done in 280.970916ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in postcompress
    in minify

   INFO  Done in 188.407666ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in mangle names
    in minify

   INFO  Done in 946.077ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in minify

Cool! Finally the name mangler is fast enough.

Wait, 10 passes? (#4373, #4378)

I couldn't reproduce the numbers in https://github.com/privatenumber/minification-benchmarks

Finally, I configured an integration benchmark in the swc crate, and I could reproduce the number. But the benchmark in swc_ecma_minifer shows quite different result. I couldn't reproduce the number in swc/minify from swc_ecma_minifier/full.

While debugging it, I dumped various variables, copied source code, and did some lots of things. But I found one stange thing in swc/minify benchmark.

   INFO  Done in 241.057958ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in optimize with pass: 10
    in compress ast
    in minify
    in Compiler.minify
    in minify

   INFO  Done in 2.593402583s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in compress ast
    in minify
    in Compiler.minify
    in minify

   INFO  Done in 44.142583ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in postcompress
    in minify
    in Compiler.minify
    in minify

   INFO  Done in 189.090166ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in mangle names
    in minify
    in Compiler.minify
    in minify

   INFO  Done in 2.922542875s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in minify
    in Compiler.minify
    in minify

   INFO  Done in 49.478041ms, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in Compiler.print
    in Compiler.minify
    in minify

   INFO  Done in 3.210377125s, kind: "perf"
    at crates/swc_timer/src/lib.rs:32
    in Compiler.minify
    in minify

It performs full AST compression 10 times. What? But default pass limit should be 1. Those options are ported from terser... So I think I made 1 default?

It turns out that the default value of passes was 1 in CompressOptions, while the default value of passes in TerserCompressorOptions was 0. passes: 0 means no limitation, and that was the cause.

I changed the default to 3 instead of 1, because I think it's more reasonable considering the performance of swc minifier.

Conclusion

While working on this, I learned the importance of time complexity, and the way to optimize it. Interpreting the result of instrumented data requires considering time complexity. e.g. If an operation which is O(1) takes too much time, it's being called enormous amount of time.

Also, always profile. You can't guess performance.

Efforts like this are helping accelerate builds for projects like Next.js while making the Web faster for end users.

© DongYoon Kang.RSS