Inside Clojure's Collection Model
What are the Clojure collection types? Most Clojurists would say lists, vectors, maps, and sets are all you need. You might add seqs too - they are collection abstractions (views) which can be backed by a collection, but might also be backed by something else. (See Sequences for lots more on this.)
But in Clojure there are also sorted maps, sorted sets, queues, primitive vectors and support for Java arrays, Java collections and more, not to mention custom types that you might create yourself using deftype. Things seem a bit messier now. And the answers to common predicates like
coll? may give answers that you find confusing for some of these collection types.
If you look a bit deeper, you’ll find that there is more to the collection model than is immediately apparent. Clojure provides a few different layers that form the foundation for what you use on a daily basis. The neatest part is that almost nothing in Clojure (compiler, runtime, or standard lib) actually depends directly on collection implementations, rather everything depends on abstractions that are open to extension. As a beginning Clojurist, you rarely tap into these extensions, but you will find them increasingly useful in more sophisticated programs.
Traits and Predicates
Perhaps the most important layer in the collection library are the traits (my term). Ideally traits would be defined as Clojure protocols, but their implementation dates back to the earliest days of Clojure, well before protocols existed and as such they are defined by Java interfaces.
Some of the key trait classes (all in
Counted- countable collections
Counted, allows index-based lookup
nth(int i, Object notFound)
Sequential- marker interface for sequential collections
assoc(Object key, Object val)
- via parent
valAt(Object key, Object notFound)
Sorted- marker interface for sorted
Seqable- for collections that can produce a sequence
Reversible- for collections that can produce a reversed sequence
Each of these interfaces describes a very narrow slice of capability that a collection might have. You’ll find that most of the Clojure predicates are simply checks for whether a class implements one of these interfaces:
counted?- checks whether instance of
indexed?- checks whether instance of
sequential?- checks whether instance of
associative?- checks whether instance of
sorted?- checks whether instance of
reversible?- checks whether instance of
The ones missing above are
ILookup is really a narrow subset of
Associative so that’s the more common predicate.
Seqable is kind of tricky as Clojure goes some distance to patch “seq-ability” into things not otherwise seqable, like strings and arrays. Again, if Clojure had protocols at the time of its creation, a “seqable” protocol could be extended to closed types like these directly. See the sequences post for more on
The important thing to note here is that most of the collection functions in clojure.core or clojure.lang.RT (the Clojure runtime) operate in terms of these traits, NOT in terms even of collection interfaces or ever in terms of concrete collection types. Tracing this requires a bit of hopping. Most of the standard collection functions (
seq) have obvious mappings to the interface methods above.
If you look at the source for most Clojure core functions (like
count), you’ll see that they largely just forward to a method on the
RT class. If you then look at RT.count(), you’ll see that most of these invoke the method above on the Java interface as their first implementation (with some backups to alternate implementations for special cases). Again, protocols would have made this cleaner.
The benefit of this approach is that you can create a new type that implements any set of these traits, and it will work with the existing Clojure library with no changes to Clojure!
The core Clojure collections are also implemented first in terms of interfaces. Most of these interfaces primarily define the set of traits a persistent collection type should include. (I’m glossing over the details a bit because how this happens via the class hierarchy is a little confusing.) A good example though is IPersistentVector, which extends
Indexed. There are similar interfaces for
IPersistentList, and a higher level interface,
IPersistentCollection. I’m not going to talk about it, but for completeness, there is also
IPersistentStack which is visible in clojure.core via
These interfaces also have corresponding predicates that just check the type:
map?. And similarly, functions in the core library or runtime never check a concrete type and only depend on these interfaces (other than for the purposes of needing to construct a concrete instance).
Finally, there are a number of concrete implementations (actually several per type):
PersistentList- typical list
PersistentList$EmptyList- special-cased empty list
PersistentQueue- a queue implementation
PersistentVector- typical vector
MapEntry- a map entry acts as a 2-element persistent vector
SubVector- created via
subvec, really a view over a source vector
clojure.core/Vec- the primitive vector implementation
PersistentHashSet- typical set
PersistentTreeSet- sorted set
PersistentArrayMap- array-based map used for 0-8 pairs
PersistentHashMap- typical hash map
PersistentTreeMap- sorted map
- all defrecords produce instances of
But what about sequences - how do they play in here? This is where things get a little tricky.
Sequences represent a logical immutable collection. When sequences are obtained from collections they are backed by the data in the collection (I think of them as “views” of the collection) but they can also be generated from a function on demand or obtained from a variety of other sources (strings, arrays, iterators, etc).
The important thing though is that sequences are again defined by interfaces in the Java implementation:
ISeq- the sequence abstraction, checked via the
Seqable(mentioned above) - something that can produce a sequence
Here’s the twist:
IPersistentCollection. So sequences are also treated as (virtual) collections, but not treated as
IPersistentLists. This can be confusing because both sequences and lists print with parentheses as if they are lists.
Just like with collections, there are lots of sequence implementations:
LazySeq- a lazy sequence
Cons- the result of a
ChunkedCons- the result of a chunked lazy sequence like
ArraySeq- sequence over an array
StringSeq- sequence over a string
- and many, many more!
Clojure collections define equality based on a set of equality partitions: sequentials, maps, and sets. Equality is always based on a value comparison - is the collection in the same partition, and does the collection contain the same values?
This is a place where Clojure departs from many other languages. Clojure’s collections are 1) immutable and 2) check equality by value. These factors allow you to treat Clojure’s collections as composite values, and retaining all the valuable properties of immutable simple values (like numbers, keywords, etc).
Sequential collections (lists, vectors, sequences, and anything marked sequential) are compared from beginning to end, element by element for equality. Sequential collections do not include collection type as a comparison criteria.
This is occasionally not what you want (for example, when testing a return value). However, making a single partition for all sequential collections is one of the great (pragmatic) choices in Clojure for improving ease of use, particularly in blurring the lines between sequences and collections.
Sets are equal if they contain exactly the same set of elements with no extra elements in either set. [By “same”, I mean “= returns true”, not an identity comparison.]
Maps are equal if they contain exactly the same set of keys and each key returns the same element.
Hopefully this provided a good overview of how the Clojure collections are implemented under the hood and integrated with the Clojure runtime via abstractions. Those abstractions are also available for you to plug into, and perhaps that’s a good subject for a follow-up post.