Madsci

Husband, Father, Musician, Engineer, Teacher, Thinker, Pilot, Mad, Scientist, Writer, Philosopher, Poet, Entrepreneur, Busy, Leader, Looking for ways to do something good in a sustainable way,... to be his best,... and to help others to do the same. The universe is a question pondering itself... we are all a part of the answer.

Aug 162025
 

It had been a bad year for computing and storage in the lab. On the storage side I had two large Synology devices fail within weeks of each other. One was supposed to back-up the other. When they both failed almost simultaneously, their proprietary hardware made it prohibitively difficult and expensive to recover and maintain. That’s a whole other story that perhaps ends with replacing a transistor in each… But, suffices to say I’m not happy about that and the experience reinforced my position that proprietary hardware of any kind is not welcome here anymore… (so if it is here, it’s because there isn’t a better option and such devices from now on are always looking for non-proprietary replacements…)

On the computing side, the handful of servers I had were all of: power hungry, lacking “grunt,” and a bit “long in the tooth.” Not to mention, the push in my latest research is strongly toward increasing the ratio of Compute to RAM and Storage so any of the popular enterprise trends toward ever larger monolithic devices really wouldn’t fit the profile.

I work increasingly on resilient self organizing distributed systems; so the solution needs to look like that… some kind of “scalable fabric” composed of easily replaceable components, no single points of failure (to the extent possible) but also with a reasonably efficient power footprint.

I looked at a number of main-stream high performance computing scenarios and they all rubbed me the wrong way — the industry LOVES proprietary lock-in (blade servers); LOVES computing “at scale” which means huge enterprise grade servers supported by the presumption of equally huge infrastructure and support budgets (and all of the complexity that implies). All of this points in the wrong direction. It’s probably correct for the push into the cloud where workloads are largely web servers or microservices of various generic types to be pushed around in K8s clusters and so forth. BUT, that’s not what I’m doing here and I don’t have those kinds of budgets… (nor do I think anyone should lock themselves into that kind of thinking without looking first).

I have often looked to implement SCIFI notions of highly modular computing infrastructure… think: banks of “isolinear chips” on TNG, or the glowing blocks in the HAL 9000. The idea being that the computing power you need can be assembled by plugging in generic collection of self-contained, self-configuring system components that are easily replaced and reasonably powerful; but together can be merged into a larger system that is resilient, scalable, and easy to maintain. Perfect for long space journeys. The hardware required for achieving this vision has always been a bit out of reach up to now.

Enter the humble NUC. Not a perfect rendition of the SCIFI vision; but pretty close given today’s world.

I drew up some basic specifications for what one of these computing modules might feature using today’s hardware capabilities and then went hunting to see if such a thing existed in real life. I found that there are plenty of small self-contained systems out there; but none of them are perfect for the vision I had in mind. Then, in the home-lab space, I stumbled upon a rack component designed to build clusters of NUCs… this started me looking at whether I could get NUCs configured to meet my specifications. It turns out that I could!

I reached out to my long-time friends at Affinity Computers (https://affinity-usa.com/). Many years ago I began outsourcing my custom hardware tasks to these folks – for all but the most specialized cases. They’ve consistently built customized highly reliable systems for me; and have collaborated on sourcing and qualifying components for my crazy ideas (that are often beyond the edge of convention). They came through this time as well with all of the parts I needed to build a scalable cluster of NUCs.

System Design:

Overal specs: 128 cores, 512G RAM, 14T Storage

Each generic computing device has a generic interface connecting only via power and Ethernet. Not quite like sliding a glowing rectangle into a backplane; but close enough for today’s standards.

Each NUC has 16 cores, 64G of RAM, two 1G+ Network ports, and two 2T SSDs.

One network port connects the cluster to other systems and to itself.

The other network port connects the cluster ONLY to itself for distributed storage tasks using CEPH.

One of the SSDs is for the local OS.

The other SSD is part of a shared storage cluster.

Each device is a node in a ProxMox cluster that implements CEPH for shared storage.

Each has it’s own power supply.

There are no single points of failure, except perhaps the network switches and mains power.

Each device is serviceable with off-the-shelf components and software.

Each device can be replaced or upgraded with a similar device of any type (NUC isn’t the only option.)

Each device is a small part of the whole, so if it fails the overall system is only slightly degraded.

Initial Research:

Before building a cluster like this (not cheap; but not unlike a handful of larger servers), I needed to verify that it would live up to my expectations. There were several conditions I wanted to verify. First, would the storage be sufficiently resilient (so that I would never have to suffer the Synology fiasco again). Second, would it be reasonable to maintain with regard to complexity. Third, would I be able to distribute my workloads into this system as a “generic computing fabric” as increasingly required by my research. Forth… fifth… etc. There are many things I wanted to know… but if the first few were satisfied the project would be worth the risk.

I had a handful of ACER Veriton devices from previous experiments. I repurposed those to create a simulation of the cluster project. One has since died… but two of the three are still alive and part of the cluster…

I gave each of these with an external drive (to simulate the second SSD in my NUCs) and loaded up ProxMox. That went surprisingly well… then I configured CEPH on each node using the external drives. That also went well.

I should note here that I’d previously been using VMWare and also had been using separate hardware for storage. The switch to ProxMox had been on the way for a while because VMWare was becoming prohibitively expensive, and also because ProxMox is simply better (yep, you read that right. Better. Full stop.)

Multiple reasons ProxMox is better than other options:

  • ProxMox is intuitive and open.
  • ProxMox runs both containers and VMs.
  • CEPH is effectively baked-into ProxMox as is ZFS.
  • ProxMox is lightweight, scalable, affordable, and well supported.
  • ProxMox supports all of the virtual computing and high availability features that matter.

Over the course of several weeks I pushed and abused my tiny ACER cluster to simulate various failure modes and workloads. It passed every test. I crashed nodes with RF (we do radio work here)… reboot to recover, no problem. Pull the power cable, reboot to recover. Configure with bad parts, replace the good parts, recover, no problem. Pull the network cable, no problem, plug it back in, all good. Intentionally overload the system with bad/evil code in VMs and containers, all good. Variations on all of this kind of chaos both intended and by accident… honestly, I couldn’t kill it (at least not in any permanent way).

I especially wanted to abuse the shared storage via CEPH because it was a bit of a non-intuitive leap to go from having storage in a separate NAS or SAN to having it fully integrated within the computing fabric.

I had attempted to use CEPH in the past with mixed results – largely because it can be very complicated to configure and maintain. I’m pleased to say that these problems have been all but completely mitigated by ProxMox’s integration of CEPH. I was able to prove the integrated storage paradigm is resilient and performant even in the face of extreme abuse. I did manage to force the system into some challenging states, but was always able to recover without significant difficulty; and found that under any normal conditions the system never experienced any notable loss of service nor data. You gotta love it when something “just works” tm.

Building The Cluster:

Starting with the power supply. I thought about creating a resilient, modular, battery backed power supply with multiple fail-over capabilities (since I have the knowledge and parts available to do that); but that’s a lot of work and moves further away from being off-the-shelf. I may still do something along those lines in order to make power more resilient, perhaps even a simple network of diode (mosfet) bridges to allow devices to draw from neighboring sources when individual power supplies drop out, but for now I opted to simply use the provided power supplies in a 1U drawer and an ordinary 1U power distribution switch.

The rack unit that accommodates all of the NUCs holds 8 at a time in a 3U space. Each is anchored with a pair of thumb screws. Unfortunately, the NUCs must be added or removed from the back– I’m still looking for a rack mount solution that will allow the NUCs to pull out from the front for servicing as that would be more convenient and less likely to disturb other nodes [by accident]… but the rear access solution works well enough for now if one is careful.

Each individual NUC has fully internal components for reliability’s sake. A particular sticking point was getting a network interface that could live inside of the NUC as opposed to using a USB based Ethernet port as would be common practice with NUCs. I didn’t want any dangling parts with potentially iffy connectors and gravity working against them; and I did want each device to be a self contained unit. Both SSDs are already available as internal components though that was another sticking point for a time.

Another requirement was a separate network fabric to allow the CEPH cluster to work unimpeded by other traffic. While the original research with the ACER devices used a single network for all purposes, it is clear that allowing the CEPH cluster to work on it’s own private network fabric provides an important performance boost. This prevents public and private network traffic from workloads from impeding the storage network in any way; and similarly in reverse. Should a node drop out or be added, significant network traffic will be generated by CEPH as it replicates and relocates blocks of data. In the case of workloads that implement distributed database systems, each node will talk to each other over the normal network for distributing query tasks and results without interfering with raw storage tasks. An ordinary (and importantly simple) D-link switch suffices for that local isolated network fabric.

Loading up ProxMox on each node is simple enough. Connect it to a suitable network, monitor, mouse, keyboard, and boot it from an ISO on a USB drive. At first I did all of this from the rear but eventually used one of the convenient front panel USB ports for the ISO image.

Upon booting each device in turn, inspect the hardware configuration to make sure everything is reporting properly, then boot the image and install ProxMox. Each device takes only a few minutes to configure.

Pro tip: Name each device in a way that is related to where it lives on the network. For example, prox24 lives on 10.10.0.24. Mounting the devices in a way that is physically related to their name and network address also helps. Having laid hands on a lot of gear in data centers I’m familiar with how annoying it can be to have to find devices when they could literally be anywhere… don’t do that! As much as possible, make your physical layout coherent with your logical layout. In this case, that first device in the rack is the first device in the cluster (prox24). The one after that will be prox25 and so on…

Installing the cluster in the rack:

Start by setting up power. The power distribution switch goes in first and then directly above that will go the power supply drawer. I checked carefully to make sure that the mains power cables reach over the edge of the drawer and back to the switch when the drawer is closed. It’s almost never that way in practice, but knowing that it can be was an important step.

Adjust the rails on the drawer to match the rack depth…

Mount the drawer and drop in a power supply to check cable lengths…

When all power supplies are in the drawer it’s a tight fit that requires them to be staggered. I thought about putting them on their sides in a 2U configuration but this works well enough for now. That might change if I build a power matrix (described briefly above) in which case I’ll probably make 3D printed blocks for each supply to rest in. These would bolt through the holes in the bottom of the drawer and would contain the bridging electronics to make the power redundant and battery capable.

With the power supply components installed the next step was to install the cluster immediately above the power drawer. In order to make the remaining steps easier, I rested the cluster on the edge of the power drawer and connected the various power and network cables before mounting the cluster itself into the rack. This way I could avoid a lot of extra trips reaching through the back of the rack… both annoying and prone to other hazards.

Power first…

Then primary network…

Then the storage network…

With all of the cabling in place then the cluster could be bolted into the rack…

Finally, the storage network switch can be mounted directly above the cluster. Note that the network cables for the cluster are easily available in the gap between the switches. Note also that the switch ports for each node are selected to be consistent between the two switches. This way the cabling and indicators associated with each node in the cluster make sense.

All of the nodes are alive and all of the lights are green… a good start!

The cluster has since taken over all of the tasks previously done by the other servers and the old Synology devices. This includes resilient storage, a collection of network and time services, small scale web hosting, and especially research projects in the lab. Standing up the cluster (joining each node to it) only took a few minutes. Configuring CEPH to use the private network took only a little bit of research and tweaking of configuration files. Getting that right took just a little bit longer as might be expected for the first time through a process like that.

Right now the cluster is busy mining all of the order 32 maximal length taps for some new LFSR based encryption primitives I’m working on… and after a short time I hope to also deploy some SDR tasks to restart my continuous all-band HF WSRP station. (Some antenna work required for that).

Finally: The cluster works as expected and is a very successful project! It’s been up and running for more than a year now having suffered through power and network disruptions of several types without skipping a beat. I look forward to making good use of this cluster, extending it, and building similar clusters for production work loads in other settings. It’s a winning solution worth expanding!

Feb 132025
 

I observed at one of my haunts that after a series of forced platform and tooling migrations with unreasonable deadlines the dev teams found themselves buried in tech debt… for example, the need to migrate from labrats to gitlab build pipelines and so forth;

Personally, I don’t run things this way… so it’s extremely painful to watch and creates a toxic environment.

While working with the team to try an wrangle the situation an image occurred to me of the team fighting a desperate battle against tech debt (and not necessarily winning). I described what flashed in my mind to gemini and it made a pretty good rendition of it. If nothing else, generative AI is a good way to rapidly express ideas…

Jan 212015
 

During an emergency, communication and coordination become both more vital and more difficult. In addition to the chaos of the event itself, many of the communication mechanisms that we normally depend on are likely to be degraded or unavailable.

The breakdown of critical infrastructure during an emergency has the potential to create large numbers of isolated groups. This fragmentation requires a bottom-up approach to coordination rather than the top-down approach typical of most current emergency management planning. Instead of developing and disseminating a common operational picture through a central control point, operational awareness must instead emerge through the collaboration of the various groups that reside beyond the reach of working infrastructure. This is the “last klick” problem.

For a while now my friends and I have been discussing these issues and brainstorming solutions. What we’ve come up with is the MCR (Modular Communications Relay). A communications and coordination toolkit that keeps itself up to date and ready to bridge the gaps that exist in that last klick.

Using an open-source model and readily available components we’re pretty sure we can build a package that solves a lot of critical problems in an affordable, sustainable way. We’re currently seeking funding to push the project forward more quickly. In the mean time we’ll be prototyping bits and pieces in the lab, war-gaming use cases, and testing concepts.

Here is a white-paper on MCR and the “last klick” problem: TheLastKlick.pdf

Jan 082014
 

Certainly the climate is involved, but this does happen from time to time anyway so it’s a stretch to assign a causal relationship to this one event.

Global warming doesn’t mean it’s going to be “hot” all the time. It means there is more energy in the atmosphere and so all weather patterns will tend to be more “excited” and weather events will tend to be more violent. It also means that wind, ocean currents, and precipitation patterns may radically shift into new patterns that are significantly different from what we are used to seeing.

All of these effects are systemic in nature. They have many parts that are constantly interacting with each other in ways that are subtle and complex.

In contrast, people are used to thinking about things with a reductionist philosophy — breaking things down into smaller pieces with the idea that if we can explain all of the small pieces we have explained the larger thing they belong to. We also, generally, like to find some kind of handle among those pieces that we can use to represent the whole thing — kind of like an on-off switch that boils it all down to a single event or concept.

Large chaotic systems do not lend themselves to this kind of thinking because the models break down when one piece is separated from another. Instead, the relationships and interactions are important and must be analyzed in the context of the whole system. This kind of thinking is so far outside the mainstream that even describing it is difficult.

The mismatch between reductionist and systemic thinking, and the reality that most people are used to thinking in a reductionist way makes it very difficult to communicate effectively about large scale systems like earth’s climate. It also makes it very easy for people to draw erroneous conclusions by taking events out of context. For example: “It’s really cold today so ‘global warming’ must be a hoax!”; or “It’s really hot today so ‘global warming’ must be real!”

Some people like to use those kinds of errors to their political advantage. They will pick an event out of context that serves their political agenda and then promote it as “the smoking gun” that proves their point. You can usually spot them doing this when they also tie their rhetoric to fear or hatred since those emotions tend to turn off people’s brains and get them either nodding in agreement or shaking their heads in anger without any deeper thought.

The realities of climate change are large scale and systemic. Very few discrete events can be accurately assigned to it. The way to think about climate change is to look at the large scale features of events overall. As for this polar vortex in particular, the correct climate questions are:

  • Have these events (plural not singular) become more or less frequent or more or less violent?
  • How does the character of this event differ from previous similar events and how do those differences relate to other climate factors?
  • What can we predict from this analysis?
Aug 262013
 

One of the problems with machine learning in an uncontrolled environment is lies. Bad data, noise, and intentional or unintentional misinformation complicate learning. In an uncontrolled environment any intelligence (synthetic or otherwise) is faced with the extra task of separating truth from fiction.

Take GBUdb, for example. Message Sniffer’s GBUdb engine learns about IP behaviors by watching SNF’s scan results. Generally if a message scan matches a spam or malware rule then the IP that delivered the message gets a bad mark. If the scanner does not find spam or malware then the IP that sent the message is given the benefit of the doubt and gets a good mark.

In a perfect world this simple algorithm generates reliable statistics about what we can expect to see from any given IP address. As a result we can use these statistics to help Message Sniffer perform better. If GBUdb can predict spam and malware from an IP with high confidence then we can safely stop looking inside the message and tag it as bad.

Similarly if GBUdb can predict that an IP address only sends us good messages then we can let the message through. Even better than that — if the message matches a new spam or malware rule then most likely we’ve made a mistake. In that case we can turn off the troublesome rule, let the message through, and raise a flag so bigger brains can take a look and fix the error.

Right?

Not always!

Message Sniffer’s Auto-Panic feature does a fantastic job of helping us catch problems before they can cause trouble, but Auto-Panic can also be tricked into letting more spam through the filters.

When a new pre-tested spam campaign is launched on a new bot-net there is some period of time where completely unknown IP addresses are sending messages that are guaranteed (pre-tested) not to match any recognizable patterns. All of these IPs end up gathering good marks for sending “apparently” clean messages… and since they are churning out messages as fast as they can they gain a good reputation quickly.

Back at the lab the SortMonsters and RuleBots are hard at work analyzing samples and creating rules to recognize the new campaign. This takes a little bit of time and during that time GBUdb can’t help but become convinced that some of these IPs are good sources. The statistics prove it, after all.

When the new pattern rules get out to the edges the Auto-Panic feature begins to work against us. When the brand new pattern rules find spam or malware coming from one of these new IPs it looks like a mistake. So, Auto-Panic fires and turns off the new rules!

For a time the gates are held wide open. As new bots come online they get extra time to sneak their messages through while the new rules are suppressed by Auto-Panic. Not only that but all of the new IPs quickly gain a reputation for sending good messages so that they too can trigger the Auto-Panic feature.

In order to solve this problem we’ve introduced a new behavior into the learning engine. We’ve made it skeptical of new, clean IPs. We call it White-Guard.

White-Guard singles out IPs that are new to GBUdb and possibly pretending to be good message sources. Instead of taking the new statistics at face value the system decides immediately not to trust them and not to distrust them either. The good and bad counts are artificially set to the same moderately high value.

It’s like a stranger arriving in a small town. The town folk won’t treat the stranger badly, but they also won’t trust them either. They withhold judgement for a while to see what the stranger does. Whatever opinion is ultimately formed about the stranger they are going to have to earn it.

In GBUdb, the White-Guard behavior sets up a neutral bias that must be overcome by new data before any actions will be triggered by those statistics. Eventually the IP will earn a good or bad reputation but in the short term any new “apparently” clean IPs will be unable to trigger  Auto-Panic.

With Auto-Panic temporarily out of reach for these sources new pattern rules can take effect more quickly to block the new campaigns. This earns most of the new bot-net IPs the bad reputations they deserve and helps to increase early capture rates.

Since we’ve implemented this new learning behavior we have seen a significant increase in the effectiveness of the GBUdb system as well as an improvement in the accuracy of our rule conflict instrumentation and sampling rates. All of these outcomes were predicted when modeling the dynamics of this new behavior.

It is going to take a little while before we get the parameters of this new feature dialed in for peak performance, but early indications are very good and it’s clear we will be able to apply the lessons from this experiment to other learning scenarios in the future.

 

Jun 072013
 

The new blackhatzes on the scene:

In the past few weeks we’ve seen a lot of heavy new spam coming around, and most of it is pre-tested against existing filters. This has caused everybody to leak more spam than usual. Message Sniffer is leaking too because the volume and variability are so much higher than usual. That said, we are a bit better than most at stopping some of the new stuff.

The good thing about SNF is that instead of waiting to detect repeating patterns or building up statistics on sender behaviors our system concentrates on finding ways to capture new spam before it is ever released by reverse engineering the content of the messages we do see.

Quite often this means we’ve got rules that predict new spam or malware hours or even days before they get into the wild. Some pre-tested spam will always get through though because the blackhatzes test against us too, and not all systems can defend against that by using a delay technique like gray-listing or “gauntlet.”

What about the little guys?

This can be particularly hard on smaller systems that don’t process a lot of messages and perhaps don’t have the resources to spend on filtering systems with lots of parts.

I was recently asked: “what can I do to improve SNF performance in light of all the new spam?” This customer has a smaller system in that it processes < 10000 msg / day.

One of the challenges with systems like this is that if a spammer sends some new pre-tested spam through an old bot, GBUdb might have forgotten about the IP by the time the new message comes through. This is because GBUdb will “condense” it’s database once per day by default… so, if an IP is only seen once in a day (like it might on a system like this) then by the next day it is forgotten.

Tweaking GBUdb:

The default settings for GBUdb were designed to work well on most systems and to stay out of the way on all of the others. The good news is that these settings were also designed to be tweaked.

On smaller systems we recommend using a longer time trigger for GBUdb.

Instead of the default setting which tells SNF to compress GBUdb once per day:

<time-trigger on-off='on' seconds='86400'/>

You can adjust it to compress GBUdb once every 4 days:

<time-trigger on-off='on' seconds='345600'/>

That will generally increase the number of messages that are captured based on IP reputation by improving GBUdb’s memory.

It’s generally safe to experiment with these settings to extend the time further… although that may have diminishing returns because IPs are usually blocked by blacklists after a while anyway.

Even so, it’s a good technique because some of these IPs may not get onto blacklists that you are using – and still more of them might come from ISPs that will never get onto blacklists. GBUdb builds a local impression of IP reputations as it learns what your system is used to seeing. If all you get is spam from some ISP then that ISP will be blacklisted for you even if other systems get good messages from there. If those other systems also use GBUdb then their IP reputations would be different so the ISP would not be blocked for them.

If you want to be adventurous:

There is another way to compress GBUdb data that is not dependent on time, but rather on the amount of memory you want to allocate to it. By default the size-trigger is set to about 150 megabytes. This setting is really just a safety. But on today’s systems this really isn’t much memory so you could turn off the time trigger if you wish and then just let GBUdb remember what it can in 150 MBytes. If you go this route then GBUdb will automatically keep track of all the IPs that it sees frequently and will forget about those that come and go. On systems that have the memory to spare I really like this method the most.

You can find complete documentation about these GBUdb settings on the ARM site.

 

Apr 092013
 

Yet another family meeting:
convoluted, confused, and intertwined with
friends not usually seen, but heard in hazy,
non-descript one sided conversations to which
you’re not usually a privy.

I phased in and out of this mysterious world,
painted an important cordial greeting upon my
face and drifted with the din of a multitude
of cute little cherubs, their bretheren and
sisterhood hooting the crisp childhood greetings
of simpler times I only envision now in my dreams.

Drifting in and out, to and fro, on waves of mystical
chaos: warm in the glow that is family even if it is
somehow distant and even unfamiliar to my typically
ordered and precise state of mind.

Strangers, now not strange, flow into my personal
universe as if they were ghosts appearing in the dark
grey corridors of some tall and mystical hall to present
tidings of terror, or fear, or joy, or bliss; and we
engage in mindless conversation to comfort us in our
naked vulnerability.

Then as our strangeness fades into a comfortable enveloping
mist we become our own small army against the unknown
and begin to speak of thoughts, beliefs, and dreams…
the kinds of words usually reserved for only the closest
of kin and those you see every day; but now is an open
opportunity to collect a new ally in a potentially
dangerous fold, that of life in extended family where
the dragon in the dark is every aging skeleton you hide
in the closet of your mind – with you now locked in
close proximity to excited peers all curious to see and
know, and all armed with the keys of ignorance and open
questions.

“Keep your wits” you think when you are awake, but the
soothing chaos seems friendlier and warmer as time wares
on and you find yourself lulled to sleep, somehow comforted
by the incomprehensible din.

Away, across the room and a see of jumbled souls all
embroyaled in senseless conversations you see your anchor.
That one familiar face that you arrived with. That one
who dragged you to this forsaken alien world now more
familiar with each moment, and you realize the reason for
your peace isn’t a follish sleepy tonic of calming chaos,
like the warm darkness of shock obliged to an animal once
cought in the jaws of it’s predator awaiting the final
passage from vital form to fodder.

This pleasant face, and it’s glow, this love, this other
soul to whom you are inexorably linked. This on has brought
you here again, and here is not so unfamiliar as it is an
extension of whatever was, what is, and what will always
be: family.

So roast the beast and sing the songs and contemplate the
murmor of countless hours in this company. It is a gift,
for the only true desert and dangerous ground for we
mortal beings of flesh and mind and soul and gifts of
spirit, the only true place of perishing in untempered,
unbearable rages of tempest and furies, the most horrible
wasteland which could cease our breath and silence our
voices in the loudest agonizing screams of pain and
terror is not here: that place is empty and alone, and
now, if you are here with me, you know there is nothing
so fearful for you, for you are not alone.

– Original (c) 1999 Pete McNeil

Apr 092013
 

On terrors and trials and troubles we tumble – so easily lost in this worlds wild jumble of chaos and rumors and strife and the hundreds of pointless distractions that cost us our marbles, and yet there’s a way if you manage to find it to keep from the fray all your virtues and kindness – a way to find joy, even bliss and good morrows. Your own private stock of fond memories to follow.

This path to good fortune is not for the timid and those who are on it could tell you some tales. Amid crisis and horror, between tears and sorrow, there’s monsters and fears and dark nights on each side of this quaint, narrow road full of light and bright moments – it’s peace and warm comfort in stark brilliant contrast to all of the dark scary places it goes past.

‘Tis love that I speak of and not simply friendship, but kinship – the kind that you find when you tarry along on your first timid step on this path that can be so uplifting or so very bad… but then once you have found them, this singular spirit, that follows you on and pretends not to fear it, you find you’re together and somehow the darkness is farther away and not nearly as heartless.

Your steps intertwine, there is dancing and wine and good words and good song and good cheer goes along and you find that no matter how hard the wind blows and no matter how scary the outside world grows, and no matter how shaky your next step may be, that your partner can help you, and does, day to day, in their subtle sweet magical spiritual ways.

You’re both stronger, and braver, and more fleet of foot and the sharp narrow path that upon you first took becomes broader and wider with each every day – soon as broad as each moment you have in your way and with each tender kiss and each loving caress you can light up the darkness – force evil’s regress.

Lonely souls fear to tread here and well so they should for this place is not for them – it does them no good and the road doesn’t widen and so they fall off. It is sad, but it happens more often than not. And it also is true for more than one in two that do venture this road that their love is not true and they find they’re apart in this harsh frightening place and they find that they can’t stand the look on their face and they stumble and cry as the frightening beasts beat them, then scurry away as their partners retreat, and they loose all the joy and tell sad sorry stories that frighten young children and prosper young lawyers.

So hold fast your other! You dare not let go. There is truth to this story. It’s not just for show! I’ve been down this long path more than once don’t you see and found many fierce terrible things in my spree. These beasties I speak of were you and were me as we fell off the path and made bitter decrees to get justice from all those around: our just share! when in-stead we were missing the love that was there.

So hold tight to your lover, make strong your belief, and find comfort in each other’s arms, and sew peace, and you’ll find after all that true love does survive all the slings and the arrows that life can provide, and in fact it repels them. It really is true… just remember this magic will take both of you.

– Original (c) 1998 Pete McNeil

Feb 182013
 

MicroNeil has always been interested in the application of synthetic intelligence to real-world problems. So, when we were presented with the challenge of protecting messaging systems (and specifically email) from abuse, we applied machine learning and other AI techniques as part of the solution.

Email processing, and especially filtering, presents a number of challenges:

  • The Internet is increasingly a hostile environment.
  • Any systems that are exposed to the Internet must be hardened against attack.
  • The value of the Internet is derived from it’s openness. This openness tends to be in conflict with protecting systems from attack. Therefore, security measures must be carefully crafted so that they offer protection from abuse without compromising desirable and appropriate operations.
  • The presence of abuse and the corresponding need for sophisticated countermeasures sets up an environment that is constantly evolving and growing in complexity.
  • There is disagreement on: what constitutes abuse, the design of countermeasures and safeguards, what risks are acceptable, and what tactics are appropriate.
  • All of these conditions change over time.

As consequence of these circumstances any successful filtering system must be extremely efficient, flexible, and dynamic. At the same time it must respond to this complexity without becoming too complex to operate. This sounds like a perfect place to apply synthetic intelligence but in order to do that we need to use a framework that models an intelligent entity interacting with it’s environment.

The progressive evaluation model provides precisely that kind of framework while preserving both flexibility and control. This is accomplished by mapping a synthetic environment and the potential responses of an intelligent automaton (agent) onto the state map of the SMTP protocol and the message delivery process.

Each state in the message delivery process potentially represents a moment in the life of the agent where it can experience the conditions present at that moment and determine the next action it should take in response to those conditions. The default action may be to proceed to the next natural step in the protocol but under some conditions the agent might choose to do something else. It may initiate some kind of analysis to gather additional information or it might execute some other intermediate step that manipulates the underlying protocol.

The collection of steps that have been taken at any point and the potential steps that are possible from that point forward represent various “filtering strategies.” Filtering strategies can be selected and adjusted by the agent based on the changing conditions it perceives, successful patterns that it has learned, and the preferences established by administrators and users of the system.

The filtering strategies made available to the agent can be restrictive so that the system’s behavior is purely deterministic; or they can be flexible to allow the agent to learn, grow, and adapt. The constraints and parameters that are established represent the system policy and ultimately define what degrees of freedom are provided to the agent under various conditions. The agent works within these restrictions to optimize the performance of the system.

In a highly restrictive environment the agent might only be to allowed to determine which DNSBLs to check based on their speed and accuracy. Suppose there are several blacklists that are used to reject new connections. If one of these blacklists were to become slow to respond or somehow inaccurate (undesirable) then the agent might be allowed to exclude that test from the filtering strategy for a time. It might also be allowed to change the order in which the available blacklists are checked so that faster, less comprehensive blacklists are checked first. This would have the effect of reducing system loads and improving performance by rejecting many connections early in the process and applying slower tests only after the majority of connections have been eliminated.

A conservative administrator might also permit the agent to select only cached results from some blacklists that are otherwise too slow. The agent might make this choice in order to gain benefit from blacklists that would otherwise degrade the performance of the system. In this scenario the cached results from a slow but accurate blacklist would be used to evaluate each message and the blacklist would be queried out of band for any cache misses. If the agent perceived an improvement in the speed of the blacklist then it could elect to use the blacklist normally again.

ProgressiveEvaluationFramework

Figure 1 – Basic Progressive Response Model for Email Processing

Refer to Figure 1. Generally the agent “lives” in a sequence of events that are triggered by new information. At the most basic level it is either acting or waiting for new information to arrive. When information is added to it’s local context (red arrows) then that new information is applied (green arrows) to the current state of the agent (blue boxes).

If the new information is relevant and sufficient then it will trigger the agent to take action again and change it’s state thus moving the process forward. Each action is potentially guided by the all of the information that is available in the local context including a complete history of all previous actions.

In Figure 1, an agent waiting asleep is prompted by it’s local context to let it know that a new connection has occurred. Let’s assume that this particular system is designed so that each agent is assigned a single connection. The agent acts by waking up and changing it’s state. That action (a change in it’s own state) triggers the next action which is to issue a command to test the local blacklist with the new IP. Then, the agent changes it’s state again to a branching state where it will respond to the local blacklist result once it is available. At this point the agent goes back to a waiting state until new information arrives from the test because it is unable to continue without new information.

Next, the local blacklist result arrives in the local context. This prompts the agent again causing it to evaluate the local blacklist result. Depending upon that result it will chose one of two filtering strategies to use moving forward: either rejecting the connection or proceeding to another test.

This process continues with the agent receiving new stimuli and responding to that stimuli according to the conditions it recognizes. Each stimulus elicits a response and each response is itself a stimulus. The chain of stimuli and responses cause the agent to interact with the process following a path through the states made available to it by progressively selecting filtering strategies as it goes.

As each step is taken additional information about the session and each message builds up. Each new piece of information becomes part of the local environment for the agent and allows it to make more sophisticated choices. In addition to conventional test data the agent also builds up other information about it’s operating environment such as performance statistics about the server, other sessions that are active, partial results from it’s own calculations, and references to previous “experiences” that are “interesting” to it’s learning algorithms.

Agents might also communicate with each other to share information. This would allow these agents to from a kind of group intelligence by sharing their experiences and the performance of their filtering strategies. Each agent would gain more comprehensive access to test data and the workload of devising and evaluating new strategies would be divided among a larger and more diverse collection of systems.

The level of sophistication that is possible is limited only by the sophistication of the agent software and the restrictions imposed by system policies. This framework is also flexible enough to accommodate additional technologies as they are developed so that the costs and risks associated with future upgrades are reduced.

Typically any new technologies would appear to the agent as optional tools for new filtering strategies. Existing filtering strategies could be modified to test the qualities of the new tools before allowing them to affect the message flow. This testing might be performed deterministically by the system administrator or the agent might be allowed to adapt to the presence of the new tool and integrate it automatically once it learns how to use it through experimentation.

So far the description we have used is strictly mechanical. Even in an intelligent system it would be possible and occasionally desirable for the system administrator to specify a completely deterministic set of filtering strategies. However, on a system that is not as restrictive there are two opportunities for the intelligence of the agent to emerge.

Parametric adaptation might allow the agent to respond with some flexibility within a given filtering strategy. For example, if the local blacklist test were replaced by a local IP reputation test then the agent might have a variable threshold that it uses to judge whether the connecting IP “failed.” As a result it would be allowed to select filtering strategies based upon learning algorithms that adjust IP reputation thresholds and develop the IP reputation statistics.

Structural adaptation might allow the agent to swap out components of filtering strategies. Segments of filtering strategies might be represented in a genetic algorithm. After each session is complete the local context would contain a complete record of the strategies that were followed and the conditions that led to those strategies. Each of these sessions could be added to a pool, evaluated for fitness (out of band), and the most successful strategies could then be selected to produce a new population of strategies for trial. A more sophisticated system might even simulate new strategies using data recorded in previous sessions so that the fitness of new filtering strategies could be predicted before testing them on live messages.

Structural and parametric adaptation allow an agent to explore a wide range of strategies and tuning parameters so it can adopt strategies that produce the best performance across a range of potentially conflicting criteria. In order to balance the need for both speed and accuracy the agent might evolve a progressive filtering strategy that leverages lightweight tests early in the process in order to reduce the cost of performing more sophisticated tests later in the process. It might also improve accuracy by combining the scores of less accurate tests using various tunable weighting schemes in order to refine the results.

Another interesting adaptation might depend on session specific parameters such as the connecting system address range, HELO, and MAIL FROM: information, header structure, or even the timing and sequence of the events in the underlying protocol. Over time the agent might learn to use different strategies for messages that appear to be from banks, online services, or dynamic address ranges.

Given enough flexibility and sensitivity it could learn to recognize early clues in the message delivery process and select from dozens of highly tuned filtering strategies that are each optimized for their own class of messages. For example it might learn to recognize and distrust systems that stall on open connections, attempt to use pipelining before asking permission, or attempt to guess recipient addresses through dictionary attacks.

It might also learn to recognize that messages from particular senders always include specific features. Any messages that disagree with the expected models would be tested by filtering strategies that are more “careful” and apply additional tests.

Systems with intelligent agents have the ability to adapt automatically as operating conditions change, new tests are made available, and test qualities change over time. This ability can be extended if collections of agents are allowed to exchange some of their more successful “formulas” with each other so that all of the participating agents can learn best practices from each other. Agents that share information tend to converge on optimal solutions more quickly.

There are also potential benefits to sharing information between systems of different types. Intelligent intrusion detection systems, application servers, firewalls, and email servers could collaborate to identify attackers and harden systems against new attack vectors in real time. Specialized agents operating in client applications could further accelerate these adaptations by contributing data from the end user’s point of view.

Of course, optimizing system performance and responding to external threats are only parts of the solution. In order to be successful these systems must also be able to adapt to changing stakeholder preferences.

Consider that a large scale filtering system needs to accommodate the preferences of system administrators in charge of managing the infrastructure, group administrators in charge of managing groups of email domains and/or email users, power users who desire a high degree of control, and ordinary users who simply want the system to work reliably and automatically.

In a scenario like this various parts of the filtering strategy might be modified or swapped into place at various stages based on the combined preferences of all stakeholders.

StructuralAdaptationPerUser

Figure 2 – Structural Adaptation Per User

At any point during the progressive evaluation process it is possible to change the remaining steps in the filtering strategy. The change might be in response to the results of a test, results from an analysis tool, a change in system performance data, or new information about the message.

In Figure 2 we show how the filtering strategy established by the administrator is followed until the first recipient is established. The first recipient is interpreted as the primary user for this message. Once the user is known the remainder of the filtering strategy is selected and adjusted based on the combined user, domain, group, and administrator preferences that apply.

Beginning with settings established by the system administrator each successively more specific layer is allowed to modify parts of the filtering strategy so that the composite that is ultimately used represents a blend of all relevant preferences. The higher, more general layers determine the default settings and establish how preferences can be modified by the lower, more specific layers.

BlendedPreferencesSelection

Figure 3 – Blended Profile Selection

Refer to Figure 3. The applicable layers are selected from the bottom up. The specific user belongs to a domain, a domain belongs to a group, a group may belong to another group, and all top level groups belong to the system administrator. Once a specific user (recipient) is identified then the applicable layers can be selected by following a path through the parent of each layer until the top (administrator) layer is reached. Then, the defaults set by the administrator are applied downward and modified by each layer along the same path until the user is reached again. The resulting preferences contain a blend of the preferences defined at each layer.

CompositeStrategyInteraction

Figure 4 – Composite Strategy Interaction

It is important to note that these drawings are potentially misleading in that they may appear to show that the agent is responsible for executing the SMTP protocol and all that is implied. In practice that would not be the case. Some of the key states in the illustrated filtering strategies have been named for states in the SMTP protocol because the agent is intended to respond to those specific conditions. However the machinery of the protocol itself is managed by other parts of the software – most likely embedded in the machinery that builds and maintains the local context.

You could say that the local context provides the world where the intelligent agent lives. The local context implements an API that describes what the agent can know and how it can respond. The agent and the local context interact by passing messages to each other through this API.

Typically the local context and the agent are separate modules. The local context module contains the machinery for interacting with the real world, interpreting it’s conditions, and presenting those conditions to the agent in a form it can understand. The agent module contains the machinery for learning and adapting to the artificial world presented by the local context. Both of these modules can be developed and maintained independently as long as the API remains stable.

It should be noted that this kind of framework can be applied broadly to many kinds of systems – not just email processing and other systems on the Internet. It is possible to map synthetic intelligence like this into any system that has sufficiently structured protocols and can tolerate some inconsistency during adaptation. The protocols provide a foundation upon which an intelligent agent can “grow” it’s learning processes. A tolerance for adaptation provides a venue for intelligent experimentation and optimization to occur.

Further, the progressive evaluation model is also not limited to large-scale processes like message delivery. It can also inform the development of smaller applications and even specialized functions embedded in other programs. A lightweight implementation of this technique underpins the design of the pattern matching engine used in Message Sniffer. Unlike conventional pattern matching engines, Message Sniffer uses a swarm of lightweight intelligent agents that explore their data set collaboratively in the context of an artificial “world” that is structured to represent their collective knowledge. Each of these agents progressively evaluates it’s location in the world, it’s location in the data set, it’s own state, and the locations and states of it’s peers. This approach allows the engine to be extremely efficient and virtually immune to the number of patterns it must recognize simultaneously.

Broadly speaking, this technique can be applied to a wide range of tasks such as automated network management, data systems provisioning, process control and diagnostics, interactive help desks, intelligent data mining, logistics, robotics, flight control systems, and many others.

Of course, email processing is a natural fit for applications that implement the Progressive Evaluation Model as a way to leverage machine learning and other AI techniques. The Internet community has already demonstrated a willingness to “bend” the SMTP protocol when necessary, SMTP provides a good foundation upon which to build intelligent interactive agents, and messaging security is a complex, dynamic problem in search of strong solutions.

Nov 092012
 

I recently read a few posts that suggest any computer language that requires an IDE is inherently flawed. If I understood the argument correctly the point was that all of the extra tools typically found in IDEs for languages like C++ and Java are really crutches that help the developer cope with the language’s failings.

On the one hand I suppose I can see that point. If it weren’t for all of the extra nudging and prompting provided by these tools then coding a Java application of any complexity would become much more difficult. The same could be said for serious C++ applications; and certainly any application with mixed environments and multiple developers.

On the other hand these languages are at the core of most heavy-lifting in software development and the feature list for most popular IDEs continues to grow. There must be a reason for that. The languages that can be easily managed with an ordinary editor (perhaps one with good syntax highlighting) are typically not a good fit for large scale projects, and if they were, a more powerful environment would be a must for other reasons.

This got me thinking that perhaps all of this extra complexity is part of the ongoing evolution of software development. Perhaps the complexity we are observing now is a temporary evil that will eventually give way to some truly profound advancements in software development. Languages with simpler constructs and syntax are more likely throw-backs to an earlier paradigm while the more complex languages are likely straining against the edges of what is currently possible.

The programming languages we use today are still rooted in the early days of computing when we used to literally hand-wire our systems to perform a particular task. In fact the term “bug” goes all the way back to actual insects that would occasionally infest the circuitry of these machines and cause them to malfunction. Once upon a time debugging really did mean what it sounds like!

As the hardware of computing became more powerful we were able to replace physical wiring with machine-code that could virtually rewire the computing hardware on the fly. This is still at the heart of computing. Even the most sophisticated software in use today eventually breaks down into a handful of bits that flip switches and cause one logic circuit to connect to another in some useful sequence.

In spite of the basics task remaining the same, software development has improved significantly over time. Machine-code was better than wires, but it too was still very complicated and hardware specific. Remembering op codes and their numeric translations is challenging for wetware (brains) and in any case isn’t portable from one type of machine to another. So machine-code eventually evolved into assembly language which allowed programmers to use more familiar verbs and register names to describe what they wanted to do. For example you can probably guess that “add ax, bx” probably instructs the hardware to add a couple of numbers together and that “ax” and “bx” are where those numbers can be found. Even better than that, assembly language offered some portability between one chunk of hardware and another because the compiler (a hardware specific chunk of software) would keep track of the specific op codes so that software developers could more easily reuse and share chunks of code.

From there we evolved to languages like C that were just barely more sophisticated than assembly language. In the beginning, C was slightly more than a handy syntax that could be expanded into assembly language in an almost cut-and-paste fashion. It was not uncommon to actually use assembly language inside of C programs when you wanted to do something specific with your hardware and you didn’t have a ready-made library for it.

That said, the C language and others like it did give us more distance from the hardware and allowed us to think about software more abstractly. We were better able to concentrate on algorithms and concepts once we loosened our grip the wiring under the covers.

Modern languages have come a long way from those days but essentially the same kind of translation is happening. It’s just that a lot more is being done automatically and that means that a lot more of the decisions are being made by other people, by way of software tools and libraries, or by the machinery itself, by way of memory managers, signal processors, and other specialized devices.

This advancement has given us the ability to create software that is profoundly complex – sometimes unintentionally! Our software development languages and development tools have become more sophisticated in order to help us cope with this complexity and the lure of creating ever more powerful software.

Still, fundamentally, we are stuck in the dark ages of software development. We’re still working from a paradigm where we tell the machine what to do and the machine does it. On some level we are still hand-wiring our machines. We hope that we can get the instructions right and that those instructions will accomplish what we have in mind but we really don’t have a lot of help with those tasks. We write code, we give it to the machine, we watch what the machine does, we make adjustments, and then we start again. The basic cycle has sped up quite a bit but the process of software development is still a very one-way endeavor.

What we are seeing now in complex IDEs could be a foreshadowing of the next revolution in software development where the machines will participate on a more equal footing in the process. The future is coming, but our past is holding us back. Right now we make educated guesses about what the machine will do with our software and our IDEs try to point out obvious errors and give us hints that help our memory along the way. In fact they are straining at the edges of the envelope to do this and the result is a kind of information overload.

The problem has become so bad that switching from one IDE to another is lot like changing countries. Even if the underlying language is the same, everything about how that language is used can be different. It is almost as if we’ve ended up back in the machine-code days where platform specific knowledge was a requirement. The difference is that instead of knowing how to rewire a chunk of hardware we must know how to rewire our tool stack.

So what would happen if we took the next step forward and let go of the previous paradigm completely? Instead of holding on to the idea that we’re rewiring the computer to do our bidding and that we are therefor completely responsible for all of the associated details, we could collaborate with the computer in a way that allows us to bring our relative strengths together and achieve a superior result.

Wetware is good at creativity, abstraction, and the kind of fuzzy thinking that goes into solving new problems and exploring new possibilities. Hardware is good at doing arithmetic, keeping track of huge amounts of data, and working very quickly. This seems like two sides of a great team because each partner brings something that the other is lacking. The trick is to create an environment where the two can collaborate efficiently.

Working with a collaborative IDE would be more like having a conversation than editing code. The developer would describe what they are trying to do using whatever syntax they understand best for that task and the machine would provide a real-time simulation of the result. Along the way the machine would provide recommendations about the solution they are developing through syntax highlighting and co-editing, hints about known algorithms that might be useful, and simulations of potential solutions.

The new paradigm takes the auto-complete, refactoring, and object browser features built into current IDEs and extends that model to reach beyond the code base for any given project. If the machine understands that you are building a particular kind of algorithm then it might suggest a working solution from the current state-of-the-art. This suggestion would be custom fitted to the code you are describing and presented as a complete simulation along with an analysis (if you want it) of the benefits. If the machine is unsure of what you are trying to accomplish then it would ask you questions about the project using a combination of natural language and the syntax of the code you are using. It would be very much like working side by side with an expert developer who has the entire world of computer science at top of mind.

The end result of this kind of interaction would be a kind of intelligent, self-documenting software that understands itself on a very deep level. Each part of the code base would carry with it a complete simulation of how the code should operate so that it can be tested automatically on various target platforms and so that new modifications can be regression tested during the development process.

The software would be _almost_ completely proven by the time it was written because unit tests would have been performed in real-time as various chunks of code were developed. I say, _almost_ because there are always limits to how completely any system can be tested and because there are always unknowns and unintended consequences when new software is deployed.

Intelligent software like this would be able to explain the design choices that were made along the way so that new developers could quickly get a full understanding of the intentions of the previous developers without having to hunt them down, embark on deep research efforts, or make wild guesses.

Intelligent software could also update and test itself as improved algorithms become available, port itself to new platforms automatically as needed, and provide well documented solutions to new projects when parts of the code base are applicable.

So, are strong IDEs a sign of weak languages? I think not. Instead, they are a sign that our current software development paradigm is straining at the edges as we reach toward the next revolution in computing: Intelligent Development Environments.