Sat, 24 May 2025 10:13:10 -0500What's Next? 2025

mr's Preposter.us Blog

I happened across an article with a few quotes from Seymore Cray and it reminded me that I was put here to make computers fast.  It's hard to think about the future of anything right now, but I have to remind myself that I grew up with Shiva and I somehow managed to imagine the future then, so there's no excuse to not do it now.

I've been fascinated with high-performance computing my entire life, and in my professional career I've leveraged this passion to the degree possibly while maintaining jobs to make a living.  I've built several small-scale or model supercomputers in my own time, but I never executed a design that leverages everything I've learned to build a top-500 ranking machine, and as of late I've lost interest because the two current applications for high-performance computing are boring (hyperscaling like AWS, Google, etc. ala "the cloud") and LLM's, Generative and other forms of harmful AI.

So here I'll describe my design for a next-generation supercomputer.  The machine I describe is for the future, a time where these contemporary HPC applications fade into obscurity (and are perhaps recognized as mistakes).  What applications will this new machine be for?  I'm not sure, but likely the kinds of applications needed to rebuild failed nations and collapsed ecosystems.


Dynamic Application-Specific Logic (DASL)
Most, if not all computers you're familiar with use one more more general-purpose CPU.  These processors work by executing a fixed number of instructions encoded in logic and microcode inside the processor.  Lists of these instructions are read from memory, decoded and applied to data stored in other areas of memory.  The number of instructions the processor has available is fixed at the time it is designed, so they tend to be fairly generic (compared to the wide-range of things people do with computers), so any software running on these computers has to go through a number of translations to fit within this limited set of commands.  This is a very clever way to solve the problem of using one processor design to run an endless number of applications (thanks Turing!), but from a performance perspective it's not ideal.

So for demanding applications specialized processors have been designed to more closely fit the application.  A common example is the GPU, used to make graphics-intensive applications like video games and other graphically-oriented software faster.

However both CPU's and GPU's instructions are "fixed" at the factory, and creating new processors to match a wider range of applications is so expensive that it's almost never practical.  But what if we created a specialized processor for every application, something where each instruction needed by the program matched 1:1 with an electrical path in the processor, and the processor contained no additional, unused instructions?  That's Dynamic Application-Specific Logic or DASL.

The hardware to do this exists today in the form of Field-Programmable Gate Arrays (FPGAs), and they are currently used to accelerate particularly time-consuming tasks in a way similar to how GPU's are used for graphics.  However, "synthesizing" FPGA's is very different from the kind of programming most programmers are used to doing, and existing FPGA hardware is more expensive and more limited than general-purpose CPU's so they are only used for very specialized tasks and require specialized skills.

DASL, as implemented by this new machine addresses these problems in two ways.  The first is by designing new hardware that works like FPGA but dramatically larger, allowing the devices to synthesize enough logic to contain entire applications.  The second is by providing a programming environment which makes the synthesis of such logic completely automatic and transparent to the programmer.  I'll go into detail about both the hardware and software responsible for this later.


Asynchronous Clockless Design (ACD)
The general-purpose CPU described above marches along to the beat of a clock.  Each time the clock ticks, the CPU executes a command.  If there's no command to execute, or if the CPU is waiting for some other device (memory, a display, a disk, etc.) the CPU waits, but the clock keeps ticking, consuming energy.

On the other hand an asynchronous circuit design has no clock.  When input to the "processor" is ready, it processes the input and provides the output.  When no input is ready, the processor is essentially stopped.  This design has the advantage of being much more energy efficient than a traditional clock-sequenced design as well as having more predictable throughput.  The reason that these designs are not used is twofold: they are not as flexible as general-purpose CPU's and they require specialized design skills to implement.  As discussed above, specialized processor designs are expensive to make at the factory and are therefore only justifiable for particularly demanding (and financially valuable) uses.

DASL addresses both of those reservations, unlocking the power of a clockless processor for general-purpose applications.  By allowing software development tools to drive the physical processor implementation, that implementation can take any form, making the adoption of a clockless processor completely transparent to the programmer.

Processors of this design are capable of delivering the result of a function as quickly as input can be supplied minus propagation delays through the physical logic (limited by the speed electrons can travel + some additional resistance, unless superconducting is involved).  Compared to traditional clocked CPU implementations which can require more than one clock cycle for a single CPU instruction, functions implemented this way can be orders of magnitude faster.


Combined Data and Processing (CDP)
Existing designs separate instruction processing (CPU) and data storage (RAM) by an interconnect (bus), typically in completely separate physical packages.  Some multiprocessor designs groups these pairs closely (NUMA nodes) to keep the data a particular processor is working with physically nearby to reduce latency between the CPU and RAM, and these CPUs make extensive use of cache memory to make the most commonly-used data appear inside the CPU, but these machines spend a lot of time moving data back and forth over these connections, and in the case of cluster-based supercomputers, these connections are completely external to the chassis, incurring exponentially more latency and reduced bandwidth.

The idea behind data locality is to keep the data a program is using close to the processor that is working with the data.  There has been a lot of brilliant work done in this area since the dawn of cluster-based supercomputers because it is critical to getting the most performance out of cluster architectures.  But however clever these systems and tools are, they will always be limited by the CPU->bus->RAM architecture at best and CPU->bus->RAM->network->CPU->RAM interconnect at worst.

This new machine solve this problem by eliminating the separate of processor and data through it's ability to dynamically synthesize the hardware to tailor the application.  The memory used to store data can be implemented on the same physical die as the processor used to process it.  When the application's need's change, the implementation of the processor/memory layout can change, eliminating the need to "fix" this relationship at design-time from the factory.

DASL-optimized applications can precisely match processor instructions to application functions, memory to application data and co-locate both in whatever way is optimal to maximize runtime efficiency.  When application usage patterns don't allow for ideal optimization, the programmer is given the ability to consciously choose trade-offs, or defer them to the user of the program.  The result is no wasted logic, no wasted memory, no wasted storage as a perfectly-optimized program will utilize all available hardware to complete application execution in the shortest time possible and return to an essentially halted state as soon as all input is consumed.


Hardware
While there is a lot of research and experimentation to do in this area, a basic high-level description of the hardware can be hinted-at based on the above.

The building block of the machine (colloquially referred to as TUB, the un-carved block) is the largest possible die of FPGA (or equivalent) fabric.  For high-performance computing applications lower than normal chip yield is acceptable, so the size of such a die can be much larger than what is typically used in contemporary designs.  The ideal form is a mesh large enough to encompass entire applications, but failing that a rich and fast interconnect between dies will be necessary (even if most applications can fit on a single die it's a good idea to design and interconnect for future use).

Since memory and processing are co-located throughout the fabric there's no need to pre-specify typical data/address/io busses but instead a wide, general-purpose interconnect between dies (potentially a 3D mesh) will be used.  To the extent that it is possible, multiple dies are encased in a single module, minimizing connection length, noise, etc.

If multiple modules are required, a similar mesh of connectivity is used and the physical packaging of the modules is as such to maintain density, likely in a stacked configuration to keep communications latencies under control.  These stacked modules are close together but allow enough clearance for a cooling fluid to circulate as needed.

TUB I/O is physically connected to the mesh at the base of a stack.  Since everything is dynamic there need not be a factory-prescribed assignment for what I/O is available, but a default or standard configuration will be provided to get things started.  Pins from the bottom module can be brought-out and connected to standard I/O interfaces like Ethernet, USB, i2c, etc.  Ideally this is realized in a modular fashion allowing the programmer/operator to customize the I/O configuration of the machine to suit the application, the user's preferences, etc.

Cooling, if necessary (the clockless design makes this a less constant problem than a clocked CPU) will seek to maximize thermal efficiency of the machine, cycling cooling fluid through the chassis to pick-up heat from the processor and drain the heat into a device that converts the heat back into electricity that can be used to offset the power use of the TUB.


Software
Earlier I eluded to the software development tools that allow a programmer to synthesize the hardware to match each application perfectly.  Historically this has been a stumbling block for many innovative hardware designs, resulting in compromises that reduced the potential of the hardware to accommodate the limitations of available programmers.  While much work remains to be done in this area, there are a few key ideas that will allow this machine's capability to be maximized by a wide-range of programmers.

The two primary challenges of maximizing the performance of a DASL-based machine are an understanding and comfort with the event-driven, synchronous nature of the clockless design and the specialized knowledge needed to design, synthesize and utilize application-specific logic.  It is tempting to solve this problem by writing a translation layer that translates an existing programming language into DASL (I have attempted this myself) and in some specialized cases this may be the right choice, but for the primary software development tools for this machine I think an entirely different approach is better.

To make the most of this machine we need to make the most out of the people who program it.  People are primarily visual animals, large parts of our anatomy are dedicated to the detection, analysis and manipulation of visual objects, symbols and ideas.  The benefits of visual computing have been demonstrated through the evolution of graphical interfaces for the people who use software, but this is a largely unexplored area for the people who create software. 

Most software today is created by writing code.  Written language is excellent for capturing and recalling sequential information, and in applications where sequential processing is needed, programming languages are well suited for writing such software.  But when applications require non-sequential, parallel or out-of-order processing, written languages begin to struggle, and while much great software for such applications have been created this way, it is a common source of errors and inefficiency with regard to maximizing the valuable use of the hardware available.

By providing a visual programming experience, this machine not only addresses this limitation of language-based programming, but unlocks the potential to leverage human's enormous visual cognitive capabilities.

Visual programming is not new, in fact is has been used in many public and commercial systems for decades, but these are generally aimed at beginners or domain-specific applications.  As such, they are "hemmed-in" to being used for specific tasks, or limited in size or scale to applications suitable for single-purpose use.  During the 80's there was some attempt made to provide visual programming tools paired with graphical user interfaces but programming in general seems to have regressed from this point back to most software development being done through writing code, and most software development tools focused on manipulating code.

To overcome these limitations and to tap-into the maximum potential of software developers, the programming tools for this computer will be visual.  While the exact nature of these tools will need to be discovered through experimentation, it will begin based on node graph architecture.  Not only does this provide a visual, physical interface for creating and connecting components of a program, it also makes obvious the flow of data between components and how logic and data can be processed simultaneously.  In language/code-based programming these relationships are notoriously difficult to see.  Given their importance in making the most out of a DASL-based machine, providing a natural, obvious way to see these connections is critical.


What's Next?
I have a few more ideas, and I could elaborate on each area endlessly, but I think I'll stop here for now.  I may revisit this to add some visuals (in the spirit of visual programming!) or fix errors/omissions/etc.).

In the meantime I'll keep working on these ideas, conducting experiments and building prototypes so that with any luck when the world needs a new supercomputer, I'll have one ready.