computing – Life At Warp 9

I’ve been developing a template for building cross-platform, hybrid cloud ready applications. The key concepts are:

Use a browser for the gui because browsers are robust, flexible, and ubiquitous.
Use a programming language that is cross platform so there is only one code-base for all targets.
Include in “cross-platform” the idea that the application can easily scale from a single desktop to a local LAN (like a logging program on Field Day, or a federated SDR control application) all the way up to “the cloud” be that home-lab, private, hybrid, or public “cloud.”
Include an open API so that the application can be extended by other interfaces, and automated via other applications and scripts.

The keys to this solution are: Chrome (or chrome based browsers) and the Go programming language. Chrome based browsers are essentially “the standard” on the ‘web; and Go is fast, flexible, cross-platform, and contains in it’s standard libraries all of the machinery you need for creating efficient micro-services, APIs, and web based applications. (you can create a basic web service in just a few lines of code!)

Working on the GoUI framework, adding a “Protect” middleware for basic rate limiting and dynamic allow/block listing.

It’s good practice to presume that any application that will live “online” will also be abused and so should have some defenses against that. A good starting point is rate limiting and source limiting… For example a desktop application that runs it’s gui in the browser should only listen to the local machine; and even if that’s the only device that ever connects to the application some sanity checks should be in place about those connections. To be sure, if you also want such an application to be shareable on a lan or scalable up to the cloud it must be even more robust…

While researching this I discovered that all of the rate limiting strategies I’ve come upon are quite literal and often complicated ways of actually counting requests/tokens in various buckets and then adding and removing counts from these buckets.

I think these miss the point… which is “rate” limiting. That is — how fast (at what rate) and from where are requests coming in. Token counting mechanisms also struggle with the bursty nature of normal requests where loading a page might immediately trigger half a dozen or more (at minimum) immediate requests associated with the UI elements on that page.

I’m going to simplify all of that by inverting the problem. Rather than count events and track some kind of sliding window or bucket scheme I’m going to measure the time between requests and store that in a sliding weighted average. Then, for any particular user I only need two numbers — the time of the last request (to compute how long since the previous request) and the running average.

Then the tuning parameters are fairly simple… The weights for the sliding average, the threshold for limiting, and a TTL so that users that go away for a while are forgotten.

Say the rate you want is at most 10 requests per second. That’s a simple enough fraction 10/1 … which, when inverted, means that your target is 1/10th of a second on average between each request.

Average_Weight = 10
Request_Weight = 1
Limit = 100ms (1/10th of a second is 100ms)

Suppose that when a new user arrives (or returns after being forgotten) we give them the benefit of the doubt on the first request:

Average = 1000ms
TimeSinceLastRequest = 1000ms
Average = (Average * 10) + (TimeSinceLastRequest * 1) / 11
1000 = 10000 + 1000 / 11 so no change

Then the page they loaded triggers 5 more requests immediately… I’ll show fractions to spare you the weird upshift into integer space.

909.091 = 10000 + 0 / 11
826.456 = 9090.91 + 0 / 11
751.315 = 8264.56 + 0 / 11
683.013 = 7513.15 + 0 / 11
620.921 = 6830.13 + 0 / 11

620 is still bigger than (slower than) 100ms so they’re still good and don’t get limited… Even better, when they wait a second (or so) before making a new request (like an ordinary user might) they get credit for that.

655.383 = 6209.21 + 1000 / 11 (see, the average went up…)

Now see what happens when an attacker launches a bot to abuse the service — maybe hitting it once every 10ms or so.

1000 = 10000 + 1000 / 11 (benefit of the doubt on the first hit)
910.000 = 10000 + 10 / 11
828.181 = 9100.00 + 10 / 11
753.802 = 8281.81 + 10 / 11
686.183 = 7538.02 + 10 / 11
624.712 = 6861.83 + 10 / 11
568.829 = 6247.12 + 10 / 11
518.027 = 5688.29 + 10 / 11
471.842 = 5180.27 + 10 / 11
429.857 = 4718.42 + 10 / 11
391.688 = 4298.57 + 10 / 11
356.989 = 3916.88 + 10 / 11
325.445 = 3569.89 + 10 / 11
296.768 = 3254.45 + 10 / 11
270.698 = 2967.68 + 10 / 11
246.998 = 2706.98 + 10 / 11
225.453 = 2469.98 + 10 / 11
205.866 = 2254.53 + 10 / 11
188.060 = 2058.66 + 10 / 11
171.873 = 1880.60 + 10 / 11
157.757 = 1718.73 + 10 / 11
143.779 = 1577.57 + 10 / 11
131.618 = 1437.79 + 10 / 11
120.561 = 1316.18 + 10 / 11
110.510 = 1205.61 + 10 / 11
101.373 = 1105.10 + 10 / 11
93.066 = 1013.73 + 10 / 11 (rate limited at faster than 100ms !!)

After 27 requests (270 ms) the bot gets 429 from the server (too many requests). If it’s silly enough to continue it gets worse. Let’s say that we have a ban threshold at 50ms (default would be anything twice as fast as the rate limit)

85.515 = 930.66 + 10 / 11
78.649 = 855.15 + 10 / 11
72.409 = 786.49 + 10 / 11
66.735 = 724.09 + 10 / 11
61.578 = 667.35 + 10 / 11
56.889 = 615.78 + 10 / 11
52.626 = 568.89 + 10 / 11
48.751 = 526.26 + 10 / 11 (gets 418 response (I am a teapot))

At this point the ip/session/tracker is added to the block-list, probably with an extended expiration (maybe 10 minutes, maybe the next day, maybe forever (until administratively removed) depending upon policy etc.

Any further requests while block-listed will receive:

503 – I am a combined coffee/tea pot that is temporarily out of coffee (Hyper Text Coffee Pot Control Protocol)

Here is a screen shot of this up and running… In this case Seinfeld fans will recognize the 503 response “No soup for you!”; and the automated block-list entry lasts for 10 minutes…

In the code above, if you look closely, you’ll see that I use a cycle stealing technique to do maintenance on the block/allow list and request rate tracking. This avoids the complexity of having a background process running in another thread. The thinking is: Since I already have the resources locked in a mutex I might as well take care of everything at once; and if there are no requests coming in then it really doesn’t matter if I clean up the tables until something happens… This is simpler than setting up a background process that runs in it’s own thread, has to be properly started up and shut-down, and must compete for the mutex at random times.

The configuration so far is very straight-forward…

When a requester goes away for a while and is forgotten, they start over with a “benefit of the doubt” rate of 1 request per second… and the system tracks them from there. I created a UI widget to inspect the state of the rate limiter so I could watch it work…

It had been a bad year for computing and storage in the lab. On the storage side I had two large Synology devices fail within weeks of each other. One was supposed to back-up the other. When they both failed almost simultaneously, their proprietary hardware made it prohibitively difficult and expensive to recover and maintain. That’s a whole other story that perhaps ends with replacing a transistor in each… But, suffices to say I’m not happy about that and the experience reinforced my position that proprietary hardware of any kind is not welcome here anymore… (so if it is here, it’s because there isn’t a better option and such devices from now on are always looking for non-proprietary replacements…)

On the computing side, the handful of servers I had were all of: power hungry, lacking “grunt,” and a bit “long in the tooth.” Not to mention, the push in my latest research is strongly toward increasing the ratio of Compute to RAM and Storage so any of the popular enterprise trends toward ever larger monolithic devices really wouldn’t fit the profile.

I work increasingly on resilient self organizing distributed systems; so the solution needs to look like that… some kind of “scalable fabric” composed of easily replaceable components, no single points of failure (to the extent possible) but also with a reasonably efficient power footprint.

I looked at a number of main-stream high performance computing scenarios and they all rubbed me the wrong way — the industry LOVES proprietary lock-in (blade servers); LOVES computing “at scale” which means huge enterprise grade servers supported by the presumption of equally huge infrastructure and support budgets (and all of the complexity that implies). All of this points in the wrong direction. It’s probably correct for the push into the cloud where workloads are largely web servers or microservices of various generic types to be pushed around in K8s clusters and so forth. BUT, that’s not what I’m doing here and I don’t have those kinds of budgets… (nor do I think anyone should lock themselves into that kind of thinking without looking first).

I have often looked to implement SCIFI notions of highly modular computing infrastructure… think: banks of “isolinear chips” on TNG, or the glowing blocks in the HAL 9000. The idea being that the computing power you need can be assembled by plugging in generic collection of self-contained, self-configuring system components that are easily replaced and reasonably powerful; but together can be merged into a larger system that is resilient, scalable, and easy to maintain. Perfect for long space journeys. The hardware required for achieving this vision has always been a bit out of reach up to now.

Enter the humble NUC. Not a perfect rendition of the SCIFI vision; but pretty close given today’s world.

I drew up some basic specifications for what one of these computing modules might feature using today’s hardware capabilities and then went hunting to see if such a thing existed in real life. I found that there are plenty of small self-contained systems out there; but none of them are perfect for the vision I had in mind. Then, in the home-lab space, I stumbled upon a rack component designed to build clusters of NUCs… this started me looking at whether I could get NUCs configured to meet my specifications. It turns out that I could!

I reached out to my long-time friends at Affinity Computers (https://affinity-usa.com/). Many years ago I began outsourcing my custom hardware tasks to these folks – for all but the most specialized cases. They’ve consistently built customized highly reliable systems for me; and have collaborated on sourcing and qualifying components for my crazy ideas (that are often beyond the edge of convention). They came through this time as well with all of the parts I needed to build a scalable cluster of NUCs.

System Design:

Overal specs: 128 cores, 512G RAM, 14T Storage

Each generic computing device has a generic interface connecting only via power and Ethernet. Not quite like sliding a glowing rectangle into a backplane; but close enough for today’s standards.

Each NUC has 16 cores, 64G of RAM, two 1G+ Network ports, and two 2T SSDs.

One network port connects the cluster to other systems and to itself.

The other network port connects the cluster ONLY to itself for distributed storage tasks using CEPH.

One of the SSDs is for the local OS.

The other SSD is part of a shared storage cluster.

Each device is a node in a ProxMox cluster that implements CEPH for shared storage.

Each has it’s own power supply.

There are no single points of failure, except perhaps the network switches and mains power.

Each device is serviceable with off-the-shelf components and software.

Each device can be replaced or upgraded with a similar device of any type (NUC isn’t the only option.)

Each device is a small part of the whole, so if it fails the overall system is only slightly degraded.

Initial Research:

Before building a cluster like this (not cheap; but not unlike a handful of larger servers), I needed to verify that it would live up to my expectations. There were several conditions I wanted to verify. First, would the storage be sufficiently resilient (so that I would never have to suffer the Synology fiasco again). Second, would it be reasonable to maintain with regard to complexity. Third, would I be able to distribute my workloads into this system as a “generic computing fabric” as increasingly required by my research. Forth… fifth… etc. There are many things I wanted to know… but if the first few were satisfied the project would be worth the risk.

I had a handful of ACER Veriton devices from previous experiments. I repurposed those to create a simulation of the cluster project. One has since died… but two of the three are still alive and part of the cluster…

I gave each of these with an external drive (to simulate the second SSD in my NUCs) and loaded up ProxMox. That went surprisingly well… then I configured CEPH on each node using the external drives. That also went well.

I should note here that I’d previously been using VMWare and also had been using separate hardware for storage. The switch to ProxMox had been on the way for a while because VMWare was becoming prohibitively expensive, and also because ProxMox is simply better (yep, you read that right. Better. Full stop.)

Multiple reasons ProxMox is better than other options:

ProxMox is intuitive and open.
ProxMox runs both containers and VMs.
CEPH is effectively baked-into ProxMox as is ZFS.
ProxMox is lightweight, scalable, affordable, and well supported.
ProxMox supports all of the virtual computing and high availability features that matter.

Over the course of several weeks I pushed and abused my tiny ACER cluster to simulate various failure modes and workloads. It passed every test. I crashed nodes with RF (we do radio work here)… reboot to recover, no problem. Pull the power cable, reboot to recover. Configure with bad parts, replace the good parts, recover, no problem. Pull the network cable, no problem, plug it back in, all good. Intentionally overload the system with bad/evil code in VMs and containers, all good. Variations on all of this kind of chaos both intended and by accident… honestly, I couldn’t kill it (at least not in any permanent way).

I especially wanted to abuse the shared storage via CEPH because it was a bit of a non-intuitive leap to go from having storage in a separate NAS or SAN to having it fully integrated within the computing fabric.

I had attempted to use CEPH in the past with mixed results – largely because it can be very complicated to configure and maintain. I’m pleased to say that these problems have been all but completely mitigated by ProxMox’s integration of CEPH. I was able to prove the integrated storage paradigm is resilient and performant even in the face of extreme abuse. I did manage to force the system into some challenging states, but was always able to recover without significant difficulty; and found that under any normal conditions the system never experienced any notable loss of service nor data. You gotta love it when something “just works” tm.

Building The Cluster:

Starting with the power supply. I thought about creating a resilient, modular, battery backed power supply with multiple fail-over capabilities (since I have the knowledge and parts available to do that); but that’s a lot of work and moves further away from being off-the-shelf. I may still do something along those lines in order to make power more resilient, perhaps even a simple network of diode (mosfet) bridges to allow devices to draw from neighboring sources when individual power supplies drop out, but for now I opted to simply use the provided power supplies in a 1U drawer and an ordinary 1U power distribution switch.

The rack unit that accommodates all of the NUCs holds 8 at a time in a 3U space. Each is anchored with a pair of thumb screws. Unfortunately, the NUCs must be added or removed from the back– I’m still looking for a rack mount solution that will allow the NUCs to pull out from the front for servicing as that would be more convenient and less likely to disturb other nodes [by accident]… but the rear access solution works well enough for now if one is careful.

Each individual NUC has fully internal components for reliability’s sake. A particular sticking point was getting a network interface that could live inside of the NUC as opposed to using a USB based Ethernet port as would be common practice with NUCs. I didn’t want any dangling parts with potentially iffy connectors and gravity working against them; and I did want each device to be a self contained unit. Both SSDs are already available as internal components though that was another sticking point for a time.

Another requirement was a separate network fabric to allow the CEPH cluster to work unimpeded by other traffic. While the original research with the ACER devices used a single network for all purposes, it is clear that allowing the CEPH cluster to work on it’s own private network fabric provides an important performance boost. This prevents public and private network traffic from workloads from impeding the storage network in any way; and similarly in reverse. Should a node drop out or be added, significant network traffic will be generated by CEPH as it replicates and relocates blocks of data. In the case of workloads that implement distributed database systems, each node will talk to each other over the normal network for distributing query tasks and results without interfering with raw storage tasks. An ordinary (and importantly simple) D-link switch suffices for that local isolated network fabric.

Loading up ProxMox on each node is simple enough. Connect it to a suitable network, monitor, mouse, keyboard, and boot it from an ISO on a USB drive. At first I did all of this from the rear but eventually used one of the convenient front panel USB ports for the ISO image.

Upon booting each device in turn, inspect the hardware configuration to make sure everything is reporting properly, then boot the image and install ProxMox. Each device takes only a few minutes to configure.

Pro tip: Name each device in a way that is related to where it lives on the network. For example, prox24 lives on 10.10.0.24. Mounting the devices in a way that is physically related to their name and network address also helps. Having laid hands on a lot of gear in data centers I’m familiar with how annoying it can be to have to find devices when they could literally be anywhere… don’t do that! As much as possible, make your physical layout coherent with your logical layout. In this case, that first device in the rack is the first device in the cluster (prox24). The one after that will be prox25 and so on…

Installing the cluster in the rack:

Start by setting up power. The power distribution switch goes in first and then directly above that will go the power supply drawer. I checked carefully to make sure that the mains power cables reach over the edge of the drawer and back to the switch when the drawer is closed. It’s almost never that way in practice, but knowing that it can be was an important step.

Adjust the rails on the drawer to match the rack depth…

Mount the drawer and drop in a power supply to check cable lengths…

When all power supplies are in the drawer it’s a tight fit that requires them to be staggered. I thought about putting them on their sides in a 2U configuration but this works well enough for now. That might change if I build a power matrix (described briefly above) in which case I’ll probably make 3D printed blocks for each supply to rest in. These would bolt through the holes in the bottom of the drawer and would contain the bridging electronics to make the power redundant and battery capable.

With the power supply components installed the next step was to install the cluster immediately above the power drawer. In order to make the remaining steps easier, I rested the cluster on the edge of the power drawer and connected the various power and network cables before mounting the cluster itself into the rack. This way I could avoid a lot of extra trips reaching through the back of the rack… both annoying and prone to other hazards.

Power first…

Then primary network…

Then the storage network…

With all of the cabling in place then the cluster could be bolted into the rack…

Finally, the storage network switch can be mounted directly above the cluster. Note that the network cables for the cluster are easily available in the gap between the switches. Note also that the switch ports for each node are selected to be consistent between the two switches. This way the cabling and indicators associated with each node in the cluster make sense.

All of the nodes are alive and all of the lights are green… a good start!

The cluster has since taken over all of the tasks previously done by the other servers and the old Synology devices. This includes resilient storage, a collection of network and time services, small scale web hosting, and especially research projects in the lab. Standing up the cluster (joining each node to it) only took a few minutes. Configuring CEPH to use the private network took only a little bit of research and tweaking of configuration files. Getting that right took just a little bit longer as might be expected for the first time through a process like that.

Right now the cluster is busy mining all of the order 32 maximal length taps for some new LFSR based encryption primitives I’m working on… and after a short time I hope to also deploy some SDR tasks to restart my continuous all-band HF WSRP station. (Some antenna work required for that).

Finally: The cluster works as expected and is a very successful project! It’s been up and running for more than a year now having suffered through power and network disruptions of several types without skipping a beat. I look forward to making good use of this cluster, extending it, and building similar clusters for production work loads in other settings. It’s a winning solution worth expanding!

Life At Warp 9

GoUI Protect Module (time based rate limiter)

Working on the GoUI framework, adding a “Protect” middleware for basic rate limiting and dynamic allow/block listing.

Compute / Storage Cluster of NUCs

System Design:

Initial Research:

Building The Cluster:

Installing the cluster in the rack: