Tech Tuesday: Auto Code
This Tech Tuesday is a bit different, since it’s a collaboration. The genesis of this is an article that Michael Cain sent me about the struggles of embedded software in cars. It’s a multifaceted problem, and not just for cars, but for IOT1 home appliances, airplanes, pipeline control systems, the electrical grid, etc.
Michael has contributed the first half of this. I will fill it out with my own comments at the end.
So, without further ado, here’s Michael!
Forty-some years ago, I was the poor schmuck in Bell Labs systems engineering who went around and told managers “It’s a software world. In the future your projects are much more likely to fail because of software than hardware.” That did not make me popular, as most project managers at Bell Labs at the time had hardware backgrounds. It didn’t make me wrong, though. A year or so after I’d been on my soapbox a senior vice president lost his job because his huge project’s software was a disaster. His replacement’s first action was to announce a major redesign and a one year slip in the schedule. This month’s issue of IEEE Spectrum has a fascinating article on the same basic idea titled “How Software Is Eating the Car.” The whole thing is worth reading. I’m going to pick out some of the more quotable parts of the article.
Electronics systems (including software) account for an increasing share of the total car cost. Like so many other devices, cars have become a collection of networked processors equipped with some specialized peripherals. In 2010 about 35% of the cost to produce a car was electronics and associated software. The article estimates that figure will reach 50% by 2030. Why? The ultrasonic sensors used to provide parallel parking assistance cost roughly $1,300 to replace. The radar sensors used to monitor blind spots and cross traffic about $2,000.
Automotive processors are specialized. The automotive environment is difficult to deal with. Large voltage spikes in the power system. Noisy in the radio-frequency sense. Vibration. Communication protocols not used anywhere else. The market for critical automotive processors is dominated by one company who makes its chips in Japan. From January through May 2021, the global automotive industry produced 4.1 million units fewer than they wanted to because that supplier could not meet demand. The problem is not that the chips themselves are difficult to make. The company has accumulated years of in-house solutions to the specific sub-problems in the automotive environment. Processor supply problems are not unique to the automotive world; personal computer manufacturers cringe any time they hear rumors that Intel or AMD are having production problems.
Even base model cars are approaching 100 separate processes running some subset of approximately 100 million lines of source code. Much of that software is written by subcontractors and companies like GM or Toyota are simply integrators. Billions of combinations are possible. It is no longer possible for the auto manufacturers to build a test vehicle for each combination. It may be impossible for them to even simulate each one within reasonable time and cost. There is a non-zero chance that you will purchase a new car in the near future and be the first person to test that particular software configuration. This used to be referred to as the “Facebook model” for testing: roll the new version(s) out to 10% of the user base and see how loudly they scream.
Senior management at the car companies are simply not prepared. Peter Mertens is the former head of Audi’s R&D effort and a member of the board of directors. In a recent interview he said, “Run a job assessment with all top managers at VW, Audi, Porsche, BMW, and Daimler tomorrow and ask them to code a small game or a simple but working virus. If they are not able to do so, fire them immediately, because they are not fit for the job.” I think that’s rather extreme, but it’s a problem companies need to worry about. Not just automotive companies, either. The F-35 fighter jet project has become infamous for its software difficulties. Boeing’s unmanned test flight of its Starliner capsule failed to reach the International Space Station because of a software bug that should have been identified years earlier (approximate launch cost, $400 million).
Now that Michael has given you the highlights, let me fill in some more of the nitty gritty. For those of us who have dipped our toes2 into the software development world, I expect much of this will have you all nodding right along. For the rest of you, let just explain that software development is not at all like hardware development, and software consistently lags behind hardware, by a lot. And there are reasons for that, some of them ‘good’ (for certain values of good), and some of them are face-palmingly stupid, but they are the current state of things.
Let’s start with an analogy. It’s not perfect, but I think it will help. Think about a bolt with a right handed thread pitch. It’s a piece of hardware. If it had a software component, it would be basically two instructions:
if (tighten) turn(right); else if (loosen) turn(left); else exit();
Now, we could expand that with some error checking to make sure the diameter and thread pitch of the hole or nut is correct, but that’s it. Also, all that software is in your3 head. Back when the world was all simple machines and hardware, life was good, life was simple.
Then the electrical engineers had to go and screw it all up with transistors and Integrated Circuitry (IC).
We don’t really need to go into the nuts and bolts of how the hardware, the IC chips and controllers, execute the logic, or how the machine code interfaces with those transistors. We need to go higher level, up the stack, as it were, because the problem is how those instructions are written. And because those instructions are written by humans, there will be errors. And because the humans writing those instructions are managed by people who have competing incentives4 when it comes to controlling for those errors, we get more errors. Controlling for errors is a never-ending struggle, one complicated by the fact that high level development5 lacks a certain degree of rigor. Why? It’s a trade-off. The higher level languages allow for more abstract thinking6 and for a larger population to be proficient at programming, but in turn, you sacrifice the rigor of the languages that work more directly at the machine level.
So, let’s talk about how software gets developed, the making of the sausage, as it were. We’ll use to a simple embedded system, an ultrasonic parking sensor for a car.
- Build the req. This is the high level plan for what the software is supposed to do. For our sensor, it needs to be able to turn the sensor on, turn it off, and do something with the information from the sensor. Here is the first major decision that can cause problems, what is done with the information the sensor produces. Does it output pure signal response and leave it to the user to figure out what it means, or does it interpret the signal internally and output something like distance information? Or does it do both, and the user chooses based upon a hardware or a software switch?
- Figure out how much memory you have to work with, and if it can be flashed (over-written). If it can be flashed, you have to plan for that. Because now it’s a code base that must be maintained.
- Develop an architecture – in the simplest terms, this is the flowchart for the software, although software architecture is more complex than a simple flow chart. What are the inputs and outputs, how does the information flow, how is it changed, etc.
- Review all of this with the development team, and the hardware engineering team, and possibly the marketing team, and whatever other business team feels like it should have a say in any of this and can get themselves on the calendar. Do this multiple times until you are in real danger of the schedule slipping. Then watch the schedule slip as the customer, or marketing, or some regulatory body, decides to stack additional features into the req, necessitating a review of the hardware by the engineering team, and a rethink of the software architecture, and round and round you go.
- If you are smart, start writing your tests. If you are not smart, just skip straight to 6.
- Start laying down code. Go back to 4. Repeat. Repeat so, so many times…
- *A miracle happens! Code is in a testable state!*
- Testing!
- Load the code onto the hardware and test according to the conditions in the req.
- Then debug the code.
- Or the hardware.
- Repeat 8.1 – 8.3 until testing passes. Be prepared at any time to return to 4 (Review).
- Ship the product! Happy Day!
- Learn that the customer has found a new and exciting way to use that hardware and software that no one thought of; OR! Learn that the customer has found a way to use the hardware and software that everyone on the hardware team and the development team was fully aware of, and had warned management about, but no one allocated the time or resources to fix that, because management could not grok that the user is just as inventive and smart as everyone on your team and was convinced that the user would never stumble across that issue.
- Write and deploy a patch. Go back to 8.
Now, do that for every single computer module the car has. Remember, it’s close to 100, and many of them are much, much more complex than a simple parking sensor. And because all embedded systems these days use flash memory, rather than having the instructions burned onto the chip, it’s entirely possible to have someone flash your device and have it do other things, like return crap data, or dangerous data, or something totally unexpected, because the hardware can probably do a whole lot more7 than just run a parking sensor. So, a clever person could examine a sensor, and realize that there are more features that can be added to the software to do all manner of interesting things. Things that may not be healthy for the car. Or its operating system, or passenger.
Oh, and every device has different inputs and outputs, because you are not coordinating your vendors.
And if the device is more complex, and capable of more than on/off/send data, you have to start thinking about security, which, frankly, no one seriously does for embedded systems, because security adds a whole lot of complexity at multiple levels, and it requires a great deal of coordination between development teams. And if all your development is farmed out to multiple vendors, and you are just an integrator, that coordination is a hell of a problem. And it’s one that if the people managing the coordination are not dialed in to the scope and severity of the problem…
Worried yet? No (What is wrong with you?!)? Yes? Just say yes, for the love of…
Good, but we aren’t done yet.
Look back to item two in my list, notice the last sentence;
Because now it’s a code base that must be maintained.
Oh, code maintenance, how I loathe thee…
Not my code, oh no! My code is well structured, and clean, and fully documented with clear, descriptive comments everywhere.
Not like your code, which has maybe six lines of comments for every 1000 lines of code, and functions that are only called once, while other functions are 1000 lines of repetitive code that should be worked into a handful of function calls, and it is all full of clever bits of compact logic that take a half hour to untangle, and conditionals that go nowhere and can never actually be executed and serve as nothing more than a visual distraction while I am trying to figure out where your bug is and how to fix it8. Because the team didn’t enforce any kind of commenting or documentation standards, or management was lax with code reviews because the schedule was slipping, or it was from a development house that thought ‘Agile Development‘ meant being quick to market, or ‘Clean Code‘ meant the file server got dusted once a week with a can of air, etc.
So now we don’t just have the difficulty of developing the software for an embedded system, and making it secure, and coordinating the integration, but we also have to maintain that software, across versions, and recalls, and bug fixes, and outsourcing, and in-sourcing it back, and acquisitions and mergers. Don’t discount those last bits. Inheriting someone else’s code from the same team is bad enough, inheriting code from a completely different company, especially if that company was in a different country… Sometimes it’s better to just re-write the whole thing from scratch.
Seriously, there are days I sincerely think about just going back to being an aerospace engineer; I hear SpaceX is hiring…
So, let’s sum this up. We have:
- Machinery that is using more and more computers
- To manage more and more of the subsystems
- Requiring more and more software to operate the computers
- Where the companies who are designing and assembling this machinery
- Do not have, and do not want, any expertise with regards to software development and management
- And they are trusting that the general public will remain largely ignorant of the mess they are putting themselves in9.
Well, until Boeing exposed the problem with MCAS. No, not the problem with MCAS, the failure of MCAS exposed the problem of companies like Boeing who want to pretend they are simply systems integrators and don’t have to concern themselves with the software of the systems they are integrating. It boils down to this:
If you are a company, and you are installing into your product any electronic device with software that exists on a flash memory module, you have to starting acting like a software company, even if you don’t write a single line of code10. You have to understand how the software is developed, and tested, so you can competently work with suppliers and understand the limitations of their hardware and software. You have to understand how patches work, and how security plays into all this. Management, and executives, have to have a clue, they can’t just shove this off on the engineering team. Because when the fit hits the shan, no one is going to care that OCP is the one who made the defective module for GM11, or that the avionics control system on a 737 was written by a subcontractor in India.
By the way, EVs will be even more dependent on computers and embedded software. Guess which company is actually doing better at this (if not perfectly)? Yep, Tesla. They write everything in-house.
Oh, and just to cue up the next problem, all the EV batteries and their source materials are made or mined in China.12
- Internet of Things
- Or our whole selves.
- The person with the wrench.
- Release schedules, bonuses, ego, etc.
- Pretty much anything above working directly on the chip instruction set or memory registers.
- Compilers take care of translating the higher level languages to machine code
- Hardware always outpaces software.
- Yes, I have a tool like this I am responsible for, why do you ask?
- Wait, I’ve heard this before… Oh Yeah! – Structural Engineering is the Art of molding materials we do not wholly understand into shapes we cannot precisely analyze, so as to withstand forces we cannot really assess, in such a way that the community at large has no reason to suspect the extent of our ignorance.
- And the more computers you have in your product, the sooner you will be writing lines of code for it, so get a jump on the mindset early.
- This is what happens when you repurpose the ED-209 system for pedestrian collision avoidance; it was only a matter of time before a farmer’s market became a bloodbath
- I do enjoy using footnotes, thank you for noticing!
“Guess which company is actually doing better at this (if not perfectly)? Yep, Tesla. They write everything in-house.” So did my old company. Know what? They still couldn’t find test documentation given to the client a year ago. They still had people reinvent the wheel. They still had zero or little documentation created/saved in a location on a network where more than 1 or 2 guys knew where it was. We still had people leave who did the work and the remaining engineers couldn’t find the code, the test results, the original code, etc….and that’s been going on for years….
Just because they write their own code doesn’t mean that the company or the employees know where it is or what it does. People leave, get fired, die, retire, etc. If you’re not spending time documenting and saving it an controlling the info, you’re still hosed.Report
Oh, sure, you can still screw it up by the numbers even if you do everything in-house, and a lot of companies do just that*. It’s just orders of magnitude worse if you don’t have control of any of it because all of your development is farmed out.
In the past, GM could order up a fuel pump from a supplier and put it through some simple stress testing to see if it passes muster. And they could do that because the engineers at GM understood how to stress test a pump to failure, and they could take that pump apart and examine it’s components carefully, etc. That parking sensor… They might know how to stress test it (maybe, although maybe not), but can they test for security?
*Our core product is developed competently, but some of our tools… I mean, footnote 8 is there for a reason. The number of software products that have failed because the company could not maintain the code are legion.Report
Decades ago, when I worked at Bell Labs, performance review consisted of the department head and the supervisors who reported to them sitting down and discussing what each employee had accomplished. Lists were prepared in advance by the supervisors. One infrequent occurrence was a supervisor asking about a particular claim with, “I don’t recall seeing the technical memo for that.” When it turned out that there was no technical memo, everyone scratched that line off the accomplishment list — if it wasn’t documented and in Engineering Records’ hands, it didn’t count.
Bell Labs’ Engineering Records was amazing. In 1980 I requested a copy of something that had been written up in 1957. It took two days to get it because someone had to go off site, pull the microfiche, and produce a paper copy. But they knew exactly where it was archived.
I can’t imagine companies getting by with that these days. I saw an RFP the other day from the US Air Force. The work involved taking two copies of a printed circuit board that did some important function in one model of airplane still in service, reverse engineering it, and providing the necessary information so more could be manufactured. Apparently the documentation had gone missing…Report
“The work involved taking two copies of a printed circuit board that did some important function in one model of airplane still in service, reverse engineering it, and providing the necessary information so more could be manufactured. Apparently the documentation had gone missing…”
Vernon Vinge, in one of his deep future sci-fi novels, invented the concept of “programmer-archeologist”. His sole job was digging into the vast software libraries of a running starship (the slow kind, in a universe with no FTL) and figuring out if they already had solutions to given problems.
It was documented, in book, as him getting bored one shift (a few years unfrozen) and digging back into how they handled “What time is is” and going back through the layers for calculating time based on their relativistic flight, back through layers and layers of software that would generate the time, all the way back to a primitive method that simple counted the seconds since some arbitrary date. (IE: Unix time, which started 1/1/70).
Thousands of years of software creation, hundreds of layers of code, and at the bottom — something some guy knocked off in a few hours.Report
Yep. Until recently, my state’s unemployment insurance system core ran on whatever the contemporary equivalent of an IBM System/370 is. At some point the state bought a “wrapper” for it that provided additional functionality by translating stuff between what the core could handle and a new external interface. Then they bought another wrapper. Then they had someone hack on a web interface that talked to the second wrapper, which talked to the first wrapper, which talked to the core that managed the official database. The whole system finally got replaced when the cost of maintaining an environment where that antique piece of binary core code would execute properly became prohibitive.
States generally buy their major software systems in combination with a maintenance contract from one of a small number of qualified vendors and get no access to the source code. At least in my state, this occasionally leads to the situation where the limiting factor in making some statutory change is how soon and at what price the vendor will implement a necessary modification in the software.Report
I wonder how many levels of compatibility wrapping the core part ran under. Back in the 80s, when I was a student, I had a summer job at a shop that still ran (amusingly, given the title of this post) some jobs written in Autocoder, an assembler for IBM 1401s, running on 370s. I’m pretty sure this was a 370 running a 360 emulator running a 1401 emulator.Report
My understanding was it was a S/370 binary originally running on the S/370 that the state owned at the time.Report
I’m tickled you caught that. I may not be old enough to remember Autocoder, but back when I was an undergrad, my boss was. There were tales told.
We are still in touch. The current joke is that he’s old enough to remember when packets had to be delivered by hand and the parity was checked with a slide rule.Report
Even back then, there was only one guy who could debug Autocoder problems, and he was a contractor, since his rare specialty put him way outside the salary structure.Report
People use things for not-designed uses all the time.
If I went down to the engineering department and said “make me a tool that moves something almost exactly 1/32 of an inch” then it would take them two months and cost about twenty thousand dollars just to design the thing. But a number-10 fine-thread bolt has 32 threads per inch, which means each turn of the bolt moves the end almost exactly 1/32 of an inch, and that means a bolt is actually a tool for moving something almost exactly 1/32 of an inch, and it costs twenty-five cents at Home Depot and there’s a big bin full of them.
******
The thing about Boeing’s MCAS is that the crashes were not actually a failure of function. The software was performing exactly as intended. It was behaving exactly as it was supposed to. It was doing what it was designed to do. The failure was in Boeing’s insistence (and the FAA’s agreement) that there was no special training needed for the new aircraft, that the software’s functions made the new aircraft behave exactly like the older one. And therefore nobody trained the aircrew with “when the plane starts going like this, it means the software is doing that, and here’s how you fix it”, and they crashed.
******
People really like to point to software as this Weird Evil Black-Magic Thing that’s just gonna rise-of-the-machines one day and kill us all in our beds. That doesn’t happen. What happens is people not understanding how to use the tools they’re using. “But that’s not their fault!” sure, but that doesn’t mean it’s some kind of weird not-predictable not-understandable software issue, it’s more like you tried to drive a screw with a hammer. “That’s silly, nobody does that!” right, because we know it doesn’t work, we don’t say that modern tools are too complex to understand and the hammer experienced a mysterious failure mode that the designer failed to identify due to incompetence or malfeasance.
*****
Sudden Acceleration Incidents have been around for decades, and every so often there’s a call for a Big Government Investigation, and it always, always turns out that someone stepped on the gas instead of the brake, and they always, always refuse to just come out and say this. Toyota got in trouble over it a few years ago, and there were several court cases over the issue, where someone insisted that they couldn’t not deny that they hadn’t not tested every potential not-optimal non-standard operational mode and that meant every Toyota was perpetually a half-second away from immediately and irrevocably going to maximum acceleration and driving directly off a cliff–or, rather that Toyota couldn’t prove that wasn’t true.Report
The problem is when the thing gets used for a non-designed purpose while it’s supposed to be doing it’s designed purpose. Nobody really cares if you take a COTS ultrasonic sensor, flash the EEPROM with a new bit of software that adjusts the emitter to a range dogs can hear, and use it to drive your neighbors nuts. But if you do it to the sensors on their new car so all the neighborhood dogs are attacking them every time they try to park…
Partly true. Remember, the problems didn’t really arise until pitch sensors failed. The plane has two pitch sensors, and the software did not have an adequate error-checking. One sensor was designated as primary, and one as secondary, and the software did not look at the data from the secondary unless the primary was reporting that it was in a failed state. That assumes that the sensor would know it has failed, which is a hell of an assumption. The software should have always been looking at both sensors, and if they were not within a certain agreement, flag it and disable MCAS.
It’s not, but the functionality is not transparent. If you did not understand how an IC piston engine worked, popping the hood of your car would not grant you the understanding you lack. As for complexity of tools, there was a Condo near Miami until just recently…
If the only connection between the fuel injection and the driver is the mechanical linkage to the foot pedal, then you are right. If there is anything else that can also manipulate that linkage, or the injector directly (or the power feeds, for EVs) – say, an adaptive cruise system, or an autopilot, then you could have a problem if that system is poorly designed.Report
…and that means a bolt is actually a tool for moving something almost exactly 1/32 of an inch…
More accurate than that if you have a well-marked circle attached to the bolt head so you can do accurate fractional turns. Some years back I got curious about something mechanical and it led me down a rabbit hole of old books learning how we got from large hand-carved wooden screws to Ramsden’s 125 threads per inch steel screws for precision survey and laboratory instruments in the 1790s (basically: use the screw to make a slightly more accurate gear, use the gear to make a slightly more accurate screw, repeat a few thousand times). Ramsden’s lab stuff could measure better than one ten-thousandth of an inch…Report
We have thousands of people dead this year because of a single, solitary “software malfunction.”
You’ve heard about the incident, but there were more news articles about dead mutton than dead people.Report
Re item 10: Back in the day, we had a saying, “No program is foolproof because fools are so ingenious.”Report
I always heard “Every time we create a fool-proof solution, nature invents a better fool”Report
Better yet: create a fool-generator, in computer-time.
Then plug all the holes that finds.Report
My problem is not that my users are fools, but that they are very clever and actively trying to make it do something that was not intended.Report
I’m fairly tangential to the meat and potatoes of this discussion…
But I can say I’m grateful to all the people in the 90’s and 00’s who were correct in the theory that they *could* code their own Data Warehouse, but overlooked the fact that they couldn’t maintain the Data Warehouse once they built it.
And a shout out to all the folks in the 10’s who were correct that you *could* code your own Data Science projects, but overlooked the fact that you can’t scale your Data Science projects to keep up with Business Requirements.
… on the topic of Systems Integrators (or companies acting as them)… will the component producers even allow them access to the code to test? I could see where they should, but I could also see where the component producers would tell them to pound sand. The ‘moat’ around fuel injectors is the cost/machinery of building them… the moat around code is cut/paste.Report
Speaking of software doing bad things because the developers and the hardware people don’t talk much…
https://en.wikipedia.org/wiki/Therac-25Report