In this episode, we speak with Sander Mertens, principal applications engineer at RTI. Find out what Sander discovered when he compared performance metrics between RTI Connext DDS and open source DDS implementations. Also, learn how to best consider these metrics when architecting and scaling your own distributed IIoT systems.
In Episode 21 of The Connext Podcast:
- [2:23] - What inspired you to measure Connext DDS and compare it to other open source DDS Implementations?
- [4:22] - What are the most common metrics to measure?
- [7:02] - Why is it important for distributed systems to be as consistent and predictable as they are fast?
- [7:55] - How do we consider latency and throughput when you start to scale your systems?
- [9:30] - What value does Connext DDS provide in regards to QoS?
- [Blog] Benchmarking Connext DDS vs. Open Source DDS
- [Blog] Announcing the Latest RTI Perftest for Connext DDS
- [Datasheet] RTI Connext DDS Professional
Steven Onzo: Hi and welcome to episode 21 of the Connext Podcast. In today's episode, we speak with principal applications engineer, Sander Mertens. We'll discuss measuring the performance of RTI Connext DDS against Open Source DDS implementations. Sander explains what particular metrics he wanted to measure and what interesting discoveries he made while doing so. Also, learn why it's important for distributed systems to be as consistent and predictable as they are fast. We hope you enjoy this episode.
Niheer Patel: Hi there. Welcome to another episode of the Connext Podcast. This is your host Niheer Patel and today we're going to talk about performance, what it means, how we measure it and to help me with this, I have a guest here, Sander Mertens from our professional services team. Sander is a principal application engineer and has recently written a blog on measuring performance and actually comparing Connext DDS to a couple other implementations you might find out there. Sander, thank you for your time here.
Sander Mertens: No problem.
Niheer Patel: What are the types of things you do on the professional services team?
Sander Mertens: Right. Yeah, so I work in RTI professional services and so what we do is we help customers be successful essentially by providing things like trainings, architecture workshops and quick starts. Basically anything to get customers to get up and running with building their systems.
Niheer Patel: We have a theme here at RTI where we are about making our customers successful and I hear this all the time from the professional services team. Your goal is customer success.
Sander Mertens: Yes, it is. Yes. We actually recently started a new initiative where we are now let's say preemptively onboarding customers with some initial training just so that they know who we are and what we do and that they also know something about the basics of our technology.
Niheer Patel: So you provide free training.
Sander Mertens: Well, so yeah, that's a good question. So we do make the distinction between the onboardings and the trainings. So onboardings are basically three sessions of two hours, where trainings are really much more involved than they spend multiple days and full time. So obviously a training provides much more detail, but the onboardings do give you a nice introduction.
Niheer Patel: Yeah. I've actually sat in on one of the onboardings and I've seen how even in a few sessions you can tailor some of the initial discussions to customer use cases and what their needs are. You've written a blog and just give us a highlight of what you were trying to accomplish with writing this blog.
[2:23] - What inspired you to measure Connext DDS and compare it to other open source DDS Implementations?
Sander Mertens: Right. Yeah. So one of the things that we typically do in services is we help customers optimize their system and people select DDS because one of the reasons is because it's really fast, right? So they want to get the best performance out of their system. So, one of the questions that we regularly get is, "How do you do that?" So I was measuring RTI Connext performance a couple of weeks ago and I was actually putting it against some of the other Open Source implementations out there because why not, and I decided to write a blog about it because I thought that what I found was pretty interesting and also, what we learn is that when customers measured performance, there are so many variables to control that it get ... It's very easy to measure things that aren't right or to draw conclusions that aren't exactly correct based on what you've measured. So I figured it would be a good thing to write a blog about that.
Niheer Patel: Yeah. You're really comparing apples to apples. There's-
Sander Mertens: Exactly.
Niheer Patel: Things like the round trip latency and we'll get into all of this, but that's comparing DDS to DDS and then when you're comparing DDS to other technologies and you don't make the distinction whether it's a protocol or connectivity framework, the performance comparisons become meaningless and could lead you down the wrong path.
Sander Mertens: Yes, that's right. I mean, so the blog focused on comparing one DDS against another DDS, which you would think is simple, but that already turns out to be pretty complicated. So yeah, if you start comparing DDS to other protocols, I mean it's possible, but you have to make sure that you're measuring the right things even more so when you're comparing DDS to DDS.
Niheer Patel: Okay. So let's maybe start with some of the fundamentals of performance, right? What are the types of things we'd want to measure? The things that come to my mind are latency, throughput and jitter and these are words that are often casually thrown around. So maybe you can enlighten us on what that means for DDS.
[4:22] - What are the most common metrics to measure?
Sander Mertens: Right, yeah. I mean, there are many metrics but those are the most common ones. So latency is basically the time that it takes for one message to go from one application to the next application. So latency is typically measured by taking the times at the two applications or doing like round trip measurements to make sure that you don't have to factor in things like the differences in the clock between applications, but basically it means how fast is my system.
Niheer Patel: It's important to make that distinction that it's application to application as opposed to when it's put on the wire versus when it's right off the wire, right? So there's these subtleties in how you're interpreting that performance measurement.
Sander Mertens: Right.
Niheer Patel: How about throughput? What does that mean for DDS?
Sander Mertens: Right, yeah. So when latency tells you something about how fast can I send data from one site to the other site, throughput this more about how much data can I send and how much time. So here, I'm trying to really saturate the network, put as much data on the network as possible and then see how much of the data is actually being received by the other application.
Niheer Patel: Is there any kind of relationship we should take away between latency and throughput? Is it typically as your throughput increases, we should expect latency to likely decrease or is that-
Sander Mertens: Right. Yeah, you're kind of stealing my thunder but that's exactly it. So latency and throughput are unfortunately mutually exclusive because when you want latency, when you on the very low latency, that means that you have to make some decisions in your configuration, that means that throughput will go down because the middleware can be less efficient than and how it packages certain things. So it has to send out everything as soon as possible. Yeah, therefore your throughput will suffer and the other way around as well.
Niheer Patel: Okay. Okay. Let's see if I can give some more thunder back.
Sander Mertens: No, you're fine.
Niheer Patel: So we talked about latency, throughput and jitter I would assume it applies to both.
Sander Mertens: Right.
Niheer Patel: So when we think about jitter-
Sander Mertens: So jitter is the variation between one measurement and the next. So for example, when you were measuring a latency, you want your latency to be very consistent, like ideally a system always gives you the same latency for every message that you sent and obviously that's not going to happen, but you want that variation to be as low as possible and that is what you measure with jitter.
Niheer Patel: When we think about industrial control systems or any type of really distributed control system where we want high performance, right? We think, okay, we want it fast and it has to be fast and as long as it's as fast as it can be, we're good to go. But what I'm hearing from you is that there's an aspect of consistency in the timing of the delivery. It may not necessarily be the case that it has to be fast, but it should come at the same time every time if possible.
[7:02] - Why is it important for distributed systems to be as consistent and predictable as they are fast?
Sander Mertens: Right. This is where measuring these things starts to get a little bit hairy because to ... You don't exactly like you said, you don't just want your system to be occasionally fast. You want it to be consistently fast. Or if not fast, at the very least you want it to be predictable which is something that is very important in real-time systems. You don't want one message to take like two microseconds, which is great, but then the next message takes a second, which is not so great, right? So.
Niheer Patel: Yeah, I've worked on a defense program where we're running at these really tight loops and these systems are so tightly configured so that if you miss that sample by a microsecond or two microseconds, you've just blown your control loop and this could mean a fault in the system. This could mean you missed your critical deadlines and you've failed your mission. So we've talked about latency, we've talked about throughput and jitter. Fast is good but consistent is equally important. Now we're talking point to point, right? So you're sending it from one, you're doing a round trip to another point. How do we consider latency and throughput when we start to scale our system? So instead of two devices talking to each other, we have 100 devices talking to each other and in any form of data model that we could think of.
[7:55] - How do we consider latency and throughput when you start to scale your systems?
Sander Mertens: Right. This is what many customers find that the test that they are doing aren't really representative for the systems that they want to build. That's okay. You'd have to do some testing and you never can do a full test of your system when you're just figuring out what the middleware can do for you, but it's one of the variables that you have to consider. So the number of applications, your data model, it does affect your performance. So for example, when you have a system with like 20 nodes and they're sending data to each other and you want to have the best throughput, well if that system, for example, is using unicast and every message is going to incur some load on the network and so sending it out 20 times basically decreases your maximum achievable throughput by 20. So these are things definitely to take into consideration. Yeah.
Niheer Patel: Okay. Thanks for that. When we talk about our value, the value that DDS provides to customers looking to scale systems, we talk about reliability and scalability and of course performance and security. What are some of the trade-offs we could consider making in our quality of service configuration to balance between something like throughput or latency or driving a scaled system?
[9:30] - What value does Connext DDS provide in regards to QoS?
Sander Mertens: Right. That's a very, very, very broad topic and that is why we have RTI services to help you with that. No, but the simple answer to that question is there is actually a QoS setting that is called latency budget and it provides you with a single dial to balance between a latency and throughput. So with that latency budget, you can essentially specify I want my message to go out at least under like 100 microseconds or something like that. Then DDS knows how long it's going to take before sending out the message. Now there's also other things, for example, if you turn on some of the other quality of service settings like reliability, they will obviously have an impact on both throughput and latency. So what we typically see customers do is test the latency and throughput we both with and without reliability.
Then there's obviously, and this is maybe even more important than the other two is the data model. So a data model is, I mean obviously if you have very large messages, then that is going to impact your latency and throughput as well. Besides the size, what is the layout of your data model? So if you have a very complex data model with strings and collections and all of that, then that might actually increase your latency as well because that requires you to spend more CPU time, right? So there's all these considerations that come into play when doing these kinds of measurements, which makes it a challenge.
Niheer Patel: Yeah, it certainly is not an easy task but that's why we have services. One of the lessons here is as you are building out your distributed system and scaling it upwards, continue to assess performance along the way, check the performance before and after you tweak a QoS setting, before you switched from unicast and multicast or before you add another node and there's a tool that services can help with and that's our RTI perf test. So that's available for free, you should certainly be using that every step of the way measuring your performance and again, right contact professional services to help you with that. Are there any parting words you'd like to share with the audience?
Sander Mertens: So I would definitely urge people that when you're comparing different DDS implementations or protocols, one of the things to really make sure is that you're measuring the same statistics because that's also like a whole different world of complexity and different ways you can measure things and so that is even when you have two tools that seem to report the same thing, make sure that they actually are because more often than not they are not.
Niheer Patel: That's a good point. You don't want to compare an average from one tool to a median from another tool, right? These are completely different statistical attributes so we'll probably dig into that in a subsequent episode. I think it would be really important to touch on and explain why the fundamentals of statistics are really important for performance. Thank you, Sander. To the audience, contact professional services if you want to know more about scaling your system, measuring performance. Checkout the RTI perf test tool on our GitHub and we'll provide a link to that and thank you for your time.
Steven Onzo: Thanks for listening to this episode of the Connext Podcast. Stay tuned for our next episode where we talked to RTI CEO Stan Schneider about his most recent eBook, The Rise of the Robot Overlords. If you have any questions or suggestions for future interviews, please be sure to reach out to us either on social media or at email@example.com. Thanks and have a great day.