Improving the performance of set
There are a few tickets in Clojure 1.7 around the scope and performance of the set
and vec
functions, which create new sets and vectors from other collections - CLJ-1384, CLJ-1618, and CLJ-1546.
Note: The work in these tickets is included in Clojure 1.7.0-alpha5.
We’ll look here specficially at set
as it’s a bit simpler. First, there has been a long-standing request that set
should check whether the incoming collection is already a set, and if so avoid constructing a new set with the same elements. This is an important and common use case when we are trying to normalize an input parameter.
One reason this has not been done prior is that until now, calling set
on a set collection was guaranteed to return a new set instance. There is a small chance that existing code relies on this property. Another subtle side effect is that if a set has metadata, calling set on it will return a new set without that metadata. Returning the identical instance would also alter this behavior.
However, one option we eventually stumbled on is to avoid creating a new set and instead just replace the meta on the existing set with nil (creating a new instance of the set as well). This change retains the existing behavior while still providing a substantial boost in performance.
CLJ-1384 has previously been committed in Clojure 1.7 and improved the code inside the PersistentHashSet create() methods to use transient collections. However, we now have an even faster possibility for incoming collections that can reduce themselves (still with transients). In general, reduce should always give us the best path. Here, the updated version of set contains the above set check and the alternate implementation based on reduce:
If you’re wondering about the weird calls to .reduce
and reduce1
, this function is defined in core.clj before the protocol version of reduce
has been defined. Bootstrapping in core.clj gets weird sometimes.
We can see some significant performance improvement with this version of set
. The first column marks the time for doing (count (into #{} coll))
for comparison. For a long time the conventional wisdom has been to prefer into
over set
or vec
for performance. The other columns version the time for the equivalent (count (set coll))
. All times were measured with Criterium quick-bench.
coll | 1.6.0 into | 1.6.0 | 1.7.0-alpha4 | 1.7.0-alpha4 + patch |
---|---|---|---|---|
(set (range 100)) | 15.4 µs | 17.0 µs | 11.4 µs | 0.0 µs |
(vec (range 1000000)) | 360.7 ms | 702.5 ms | 391.1 ms | 358.6 ms |
(doall (range 1000000)) | 363.6 ms | 736.9 ms | 387.5 ms | 371.0 ms |
(doall (range 5)) | 404.9 ns | 612.3 ns | 481.9 ns | 445.9 ns |
As we can see, the times for set
vs into
are now similar, except for the special case where a set is passed as input (the normalization case). In this use case, set
should be strongly preferred.