When Modules Are Not Just Namespaces

July 25, 2022 Modules, Types, Inheritance

Page content

It is time for Cone to get a proper module system. This design space is complex and rife with historical missteps. In order to distill the topography of the landscape and clarify the key requirements, I felt it necessary to begin the journey with cross-language research and contemplation.

I wound up pursuing three rabbit-chasing adventures in module wonderland:

What do programmers want from modularity? You can read about this adventure in my earlier post on modularity, which captures what we want from modularity, and summarizes how the three modularity capabilities are surfaced across different layers of programming language features (and program decomposition). For modules, the key takeaway is that namespace capability and encapsulation is insufficient: interface-based substitution and multi-use generation offer value. Few languages offer this.
What does Graydon Hoare mean by: Most languages have module systems with serious weaknesses? I capture the start of this adventure in my Reddit post. This led to researching various dialects of SML, including OCaml, MixML, and 1ML, as well as understanding how ML modules differ from the type class approach of Haskell¹. Understanding prior art in this design space helped clarify for me how to enrich modules in ways that avoid known potholes.
What role should modules play in the build system? It is sometimes tempting to consider modules as “just” a namespace mechanism that distinguishes named collections of unique names. This simplistic understanding doesn’t just ignore the power of module polymorphism, it also ignores the reality that modules are only high-level language feature that explicitly spills out from the compiler into the build system. A good design for modules needs clarity on the role modules play with respect to packages, source files, separate compilation, and public interface publishing.

It’s crazy how hard I wrestled with these issues. I wrestled because the implementation effort turns out to be surprisingly invasive. I searched, in vain, for a simpler approach. There isn’t one. Cone must take the same laborious path taken by languages like Java, C# and Rust. I do not look forward to the work and complexity this brings to the compiler and build system.

In the next two posts, I want to share what I learned, what decisions I made, and why. This post focuses on foundational design decisions surrounding modules and the build system. The next post will drill into design decisions around module polymorphism.

What kind of namespace does a Cone module offer?

As with most languages that offer modules or namespaces (e.g., C++, Rust, C#, Java), a Cone module is a named namespace owner for some collection of named, top-level entities, such as:

Global variables
Functions
Types
Macros

This is in contrast to Cone’s nominal types (e.g., struct) which are also namespaces. Nominal type namespaces carry a somewhat different collection of named entities: fields, methods, etc.

Every named entity (except, notably, overloaded functions) has a unique name within the module’s namespace. Failure triggers a compile-time error.

Like C++ and Rust, module name qualification makes it possible for one module to refer to an entity in another module. For example, std::print would be used to refer to the print function in a module named std. Namespace-folding (discussed later) can be used to avoid explicit module name qualification.

What is the relationship between modules vs. packages?

Packages are libraries of code that another piece of code can retrieve, import and use via its public interface. A package is typically composed of multiple source files.

Where do Cone modules fit in this schema? The simplest approach, which Cone takes, is that every package is a module. Packages are a natural inflection point for modularity, where we clearly need a well-crafted public interface for logic that we want packaged together, and whose ongoing maintenance is separately handled by different teams of people. Indeed, the predominant use case for modules in other languages is building and using library packages.

From this perspective, it is clearly a blessing that every package has its own namespace. This way, when we combine multiple packages together into a program or library, the public names of one package won’t collide with objects of the same name in other packages.

Viewed the other way, a Cone module is always essentially a package, even if never formally and publicly published. By viewing modules as not just namespaces, but also packages, several corollary decisions flow naturally:

Module >= Source File. A module is implemented by one or more source files. For example, it is a common pattern to implement each of a module’s types in its own source file. To state this idea in reverse: A source file is wholly owned by exactly one module, it cannot provide logic for multiple modules.
Modules are single-level. A module cannot be composed of sub-modules. As namespaces, they go only one level deep. Although it can be helpful for a module name to suggest some sort of multi-part semantic taxonomy (e.g., System.Collections.Concurrent), forcing package management and build software to manage multi-level package names would add a lot of unnecessary complexity.
DAG module dependencies. Module dependencies should form a directed-acyclical graph (DAG). From a package management perspective, recursive modules introduce unnecessary problems, and are largely unnecessary and avoidable with good design.

… rejecting Rust’s (sub-)source file modules

In making the above design decisions, I explored and ultimately rejected Rust’s different approach to modules. With Rust, modules are primarily just namespaces, but can (sometimes) be packages. In particular,

Module <= Source File. A source file can define multiple modules, but a module is strictly never larger than a source file.
Modules are multi-level. A module can be composed of multiple submodules, and so on.
Modules can be recursive, in that module A can refer to module B which refers to module A.

Since Rust packages are implemented as modules, does this mean a package must be implemented wholly within a single source file? Fortunately, no. Rust makes it possible to create packages whose logic is spread out across multiple source files. But since every source file in a package is itself a module, we must use various language features (e.g., mod, use) and patterns (facades and preludes) to stitch multi-level modules together into a single top-level module that becomes the package.

Rust’s module system is more versatile than Cone’s. It is also more confusing and complicated to use. To learn more about this, read “Revisiting Rust’s modules” or examine the complicated nature of Rust’s evolving module paths, which it needs to facilitate module recursivity, which in turn is needed to overcome the problems that arise from limiting a module to a single source file.

I ended up rejecting Rust’s approach for Cone’s simply because I do not see the value of forcing every source file to be a separate module. Further: How often do I need modules to be multi-level? Is it even good design to support recursive modules? I could never find sufficient incremental value that would be gained from Rust’s module versatility to make it worth increasing user confusion and complexity this much.

Should modules independently define their public interface?

C and C++ explicitly separate declarations (public interface) and definitions (implementation) for the same entities:

Declarations are found in include files (.h). They provide summary information about public entities, such as public global variables and their type, public types/classes (specifying their structure and the signatures of all methods), and public macros.
Definitions are found in source files (.c, .cpp). They reiterate (nearly) all of the above, while providing additional implementation details for all public and private entities, such as the initial value for global variables and implementation logic for functions and methods

There are benefits from being this explicit and concise about a module’s public interface (API). Since include files are much smaller than implementation source files (for obvious reasons), it is easier for a human programmer to learn about the module’s public interface. It is also easier on the compiler, as it allows the compiler to compile one source file at a time, using the smaller include files to obtain type and other information about public entities defined elsewhere that this source file depends on.

However, this approach impedes developer productivity, because most of the information captured in include files must be redundantly specified in the corresponding source files. Programmers waste time during code creation and follow-on maintenance specifying the same information twice, only to correct it when the compiler tells them they did so inconsistently.

A long time ago, I determined that Cone, like many other languages, should not define public interfaces via separate declaration (include) files. Instead, public interface information would be conveyed by marking every named entity in module source files as public or private. A module’s public API would then implicitly correspond to all its public entities.

Unfortunately, implicit public interfaces on modules can slow down the compiler. Without smaller include files, compiling some source file S requires that the compiler load, parse and type-check larger source files (with all their implementation details) to extract interface information that S depends on to compile correctly. Although obtaining this information from implementation files is doable, it is significantly more computationally expensive than obtaining the information from interface files.

Is there a way to speed this up? For inter-package dependencies (as implemented by C#, Rust and others), best practice is for the compiler to extract and separately save every package’s public interface as part of the process of building each package, similar to the benefits of using C/C++ pre-compiled headers. Compiling S need only load only each package’s pre-built public interface.

Although this approach improves developer productivity twice over, the compiler writer (me) bears a heavy cost for this convenience. It is not a trivial effort to add the compiler the ability to ingest, preserve, and re-ingest public interface information from source files, serialized in a rather complex format. What makes this format so complex is that it carries all the considerable semantic content of a parsed public interface, roughly equivalent to the full collection of IR nodes generated after parsing.

What is the compilation unit: source file or module?

In C and C++, each source file is compiled separately. This facilitates these benefits:

Incremental builds. Often, one only needs to recompile the source files that were modified, not all of them. This speeds up re-builds.
Selective linking. When a program links in logic from libraries, it includes only those compiled object files that it actually depends on. This ensures the program executable is as small as possible. For example, if the program uses floating-point arithmetic only, it need not include separately-compiled logic for calculating logarithmic or trigonometric values.

What makes it possible for each C++ source file to be compiled separately? None of them depend on each other for compile-time information. Everything they need to know about their dependencies on other source file(s) can be found in #included header files.

But if we don’t have separate include files, then how do we handle when some source file refers to a type or other entity implemented in another source in the same package? Attempting to inject sufficient “import” information into each source file for all its external dependencies would only worsen our maintenance and redundancy issues.

The only sensible approach that allows multiple source files in the same module to refer to entities in each other, is to compile all of a module’s source files together, at the same time. In effect, the module itself becomes the compilation unit: all its source files are loaded, parsed, type-checked with each other, and generated into separate object files.

Doing this means we lose some degree of incremental build capability at the source file level. However, as long as we still generate separate object files for each source file, we do not lose the selective linking benefit described earlier.

How should namespace folding work?

As mentioned earlier, a module can use qualified names to uniquely refer to entities defined in another module. However, code can look cluttered if use of imported names must always be fully qualified. When names are unambiguous, why require them to be namespace-qualified?

Namespace folding is a solution to this problem. It allows the programmer to explicitly “fold in” all the names of an imported package via a ‘using’ statement, and verify there are no ambiguities. The benefit of the ‘using’ statement is further improved by supporting the re-aliasing or exclusion of folded names.

As with many languages, Cone supports namespace-folding at the source file level. Although ‘using’ statements could be done more efficiently once per package, this would complicate work for both the programmer and the IDE to coordinate the accuracy and applicability of a single, separate definition of namespace-folding ‘using’ statement across multiple source files.

As a side note, I find it intriguing that applying a similar name-folding capability to types yields the intriguing delegated inheritance capability planned for Cone.

Summary

Here is a summary of the foundational design decisions for Cone’s module system:

A module is a top-level namespace for global variables, functions, macros, types, etc.
A module is package-sized, not source file-sized.
- Module >= Source File.
- Modules are single-level.
- Module dependencies should form a directed-acyclical graph (DAG).
A module’s declarations are inferred from its definitions
- Building a package creates a “pre-compiled” public interface file.
A module is the compilation unit, not a source file.
A module supports namespace-folding at the source file level.

Looking across all the design decisions described so far, you will find no innovation. The approach I have selected for Cone’s modules has already been pioneered and battle-tested by other languages (e.g., C#). At least now I understand how modules are more than just namespaces, and what I must do to implement the features described so far.

This is not the end of Cone’s module story, as I have not yet talked about polymorphic modules. That is the subject of my next post.

¹I am particularly grateful for the helpful guidance and resources offered by typesanitizer, razzberry and EashenHatti.