Skip to content

Srs Bzns Lawlz

Yes we have no bananas. When queried about recently inexcusable logistics issues, the vendor in question replied with “yes yes they will be there soon”. As a result much drama ensued, and the broader hierarchy has yet to acknowledge culpability in triggering the subsequent fail-cascade.

Similarly, a major enterprise in the process of engaging an arch-rival in a target-leveraged merger used that occasion to liquidate all their existing partner networks that enabled customer usage by the principal party, and announced a ‘new strategic venture’ with all the same partners to take effect after a lengthy hiatus with a net effect of severely impacting customers’ ability to generate revenue for the innovators.

Three more incidents from an IT-centric perspective. First, an individual wants to know how to order enterprise system drives from different batches from his OEM vendor. Having dealt with incidents where pallet lots had common issues, I can appreciate the concern for statistical dispersion. I may not have the maths at hand to fully prove the case from a strictly statistical standpoint, but the issue here is that the wrong perspective is being used as the foundation.

The engineer rather needs to make a suits-based series of determinations: What is the business criticality of this system? What is the cost of temporary vs permanent outages? What holistic DR/BC plans have been made comprising platforms, processes, and procedures? Has the delivered system been validated and tested?

Physical modules being a constantly variable commodity, the expectation is that they will fail, therefore the whole system must be designed with that in mind. As an exercise, 20 people x $20 per hours @ 1 business day outage alone equals the cost of a dual-controller RAID system chassis; base storage costs are separate at this point of the DR scope. Most business-grade storage systems now shipping support configurations such as redundant-pathed RAID 60 etc at the SMB tiers. The solution isn’t to mix and match pallet batches, the solution is for the pointy-haired departments to do an honest assessment of true operations costs, and true risk of interruptions of business.

The second issue for discussion requires a quote from the authoring individual:

[W]e are working on a new internal network for a small UAV and will base it on 100mbit Ethernet instead of serial coms that’s usually used for these types of aircraft. …

The flight computer is connected to a switch that distributes data to all other devices on the network. If this switch goes down all internal coms on the aircraft fails, which is less than ideal!

It should be obvious to the casual reader that this inquiry, if not an outright troll, is so grossly inappropriate that the individual should justly be censured from the project. More critically, it demonstrates a systemic incompetence on the part of management as this should never have seen the light of day.

For the lay reader, the inquirer is wondering about how to replace a validated mission-critical-certified system, with an adhoc commodity due to ease-of-use constraints. I first draw your attention to the phrase ‘usually used’ in the first paragraph. The only rational reply that ought to be made here is “Why” (are those systems usually used)? Those systems are specifically designed at every level to be completely fault-tolerant, fail-safe, and redundantly scalable. The underlying question is “what is the cost if it breaks?” From an engineering standpoint, if the system is physically large enough to accommodate and utilize the kind of equipment we’re not talking about a little hand-tossed pizzabox UAV, we’re talking about a regular aircraft subject to (now or eventually) full-on regular aircraft flight control regulations.

Again, the cost is “what happens when this goes kapunk?” Will the aircraft fall out of the sky and make a hole somewhere severely inconvenient? What’s the cost of replacing an entire deliverable vs this module on all deliverable? What external processes or regulations directly or may directly apply to this deliverable? How does this apply with indirect systems?

This ties into the ongoing hubris surrounding SCADA and similar systems. Supposing you have a fully certified fault tolerant controls system, what about your monitoring system? Is that built to a comparable level? What happens to your operational process if the non-fault-tolerant monitoring system reports data that is contrary to the data in the controls system?

If you look at the academic course offerings in Engineering professions, I have yet to see any sort of offerings of courses in ‘Whole-system operations/environments/management’ big-picture stuff. And as a consequence planes fall out of the sky, servers lose emails, and drama ensues.