AWS re:Inforce 2023 - Build code signing and crypto PKI tooling, featuring NVIDIA (DAP303)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- [Dmitry] Hello and welcome. How's everyone doing tonight? - [Audience] Good. - Wonderful. Well, if you're in this room, then you must be really curious about code signing, hardware security modules, or crypto PKI tooling because in this session you will learn how NVIDIA built one of their code signing services using AWS CloudHSM. My name is Dmitry Kovalev. I'm Account Manager with AWS and I'm delighted to introduce this team of NVIDIAns, Daniel Major, the distinguished Security Architect, and Karthik Jayaraman, the Senior Software Engineer, who will tell you about their journey on AWS, as well as about their CloudHSM use cases that very well might be applicable to your particular scenarios as well. So here is our plan for tonight, here is our agenda. In the best traditions of the layered security, we will tell the story in layers. I'll first kick it off with a brief introduction where I will talk high level about the problem that NVIDIA was trying to solve, as well as do a quick recap of CloudHSM just to level set everyone. I'll then pass it over to Daniel who will make one step deeper into the problem and he'll do a code signing security primer as well as talk about NVIDIA Simple Signing Services and the way how it addressed their challenges. We'll then transition to Karthik, who will do a true 300-level deep dive into the problem. In fact, he will talk about five mini problems or use cases that NVIDIA faced and how they leveraged AWS CloudHSM for their code signing service. We'll wrap it up with some closing thoughts and key learnings at the very end. And that will be our plan for tonight. So I hope you all are equally excited as we are for this discussion, and let's go ahead and get started. And I'd like to kick it off with a question. What is the first thing that comes to your mind when you hear the words NVIDIA? I'll just let you shout it out loud. GPOs, graphics. Yeah, all great starting points, but I think many of you might be surprised that it's actually way more than that. Because for the past 30 years, NVIDIA has been heavily investing into some serious innovations revolving around artificial intelligence and machine learning. They reinvented the modern graphics with technologies such as path tracing and deep learning super sampling. They created their own version of industrial metaverse called Omniverse. They turbocharged the science by creating the whole suite of GPL-accelerated applications and libraries for genomics research. They pretty much became the modern engine of artificial intelligence with technology such as CUDA and the advance of foundational large language models. The last but not the least to mention here, NVIDIA has been reshaping the future of the autonomous vehicles with their drive platform. Now why all of that is important in the context of our discussion. If you look into those areas on the slides, all of these areas require software capabilities. And that means that NVIDIA is writing a lot of software. And when I say a lot, I really mean it. Today NVIDIA has very wide software portfolio that ranges from Windows drivers and applications, to Linux kernel modules and packages, to GPU firmware, to networking firmware and associated application systems and operating systems. NVIDIA writes the autonomous vehicle software. They also do a lot of attestation packages. And the list can go on and on and on. Today NVIDIA has more than 400 various SDKs that they bring to this world. And all this software needs to be protected. NVIDIA customers need to have confidence that the software has indeed been developed by NVIDIA and it hasn't been altered or tampered along the way. How do we do it? We do the code signing. We essentially digitally sign or cryptographically sign the software to ensure that it hasn't been modified. Now I have a question for you. If you were to build a secure, highly available, highly performant code signing service capable of scaling to more than 60,000 signing events per day, how would you do it? And how would you manage the cryptographic keys in this situation? One way of doing it is by leveraging AWS CloudHSM. AWS CloudHSM is a hardware security module that allows you to generate and use cryptographic keys on AWS cloud. It essentially helps you to meet your corporate contractual and regulatory compliance requirements for data security by giving you a complete control over your keys. And there are multiple reasons why CloudHSM might be a good choice for your deployments. Maybe you're looking for the solution that is highly performant, that should meet the requirements of your applications from performance standpoint, while at the same time addressing your reliability and high availability goals. Maybe you're looking for this solution that supports cloud elasticity by adding and removing HSM instances, and also securely replicating the keys between them and load balancing between them to provide higher e-durability as well as improved capacity. Or maybe you're looking for the way to demonstrate the compliance with the key security regulations such as PCI, GDPR, HIPPA, or FedRAMP. Or maybe you're simply looking for an open solution that supports the wide range of cryptographic algorithms and standards such as PKCS 11, Java cryptography extensions, JCE, OpenSSL, or CryptoAPI CNG, just to name a few. At the end of the day, the main benefits that AWS CloudHSM gives you, it provides you with low latency access to a dedicated secure root of trust that is completely under your control. So again, there are multiple reasons why CloudHSM might be beneficial for your deployment, and many of these reasons that you see on this slide are the reasons why NVIDIA picked CloudHSM in their case. But I think nobody would tell the story better than NVIDAns themselves. So Daniel, why wouldn't you take it over from here? - Thank you, Dmitry. Hello. My colleague Karthik and I are pleased to share with you today the tremendous work of a team of very talented engineers at NVIDIA. As Dmitry has mentioned, NVIDIA has a very large software portfolio. And as our industry security maturity increases, much of the software needs to be signed before it's released to customers or loaded onto production devices. This means that organizations, such as yours and NVIDIA, maintain a very large set of software signing keys as well as internal systems to sign software with those keys. A little bit later in the presentation Karthik and I will describe one of those systems at NVIDIA, and some of the problems we solved along the way that might be relevant in your context. But enough about NVIDIA for the moment. Let's talk about you. How many of you here in the audience think you know or can hazard to guess as to where your organization stores your code signing keys? Hands up if you think you can hazard a guess. Awesome, there's some knowledge of people in the audience. Of those people, how many think that your organization stores them in a hardware security module or HSM? Cool. Good to see. How many store some of your soak code signing keys in a secrets manager like KMS, or a HashiCorp Vault, or CyberArk, or Secrets Manager? Even more. How many of your organizations store your soak kining keys in a VM somewhere, signing server in plain text? And how many think that you have copies of code signing keys in source control somewhere? Was hoping not to see that. But it can be difficult for organizations to perform code signing securely. But just to make sure we're all on the same page, let's do a quick primer on what we mean by code signing and what it means to do it securely. Code signing is when a file, usually some binary code, which is why we call it code signing, is processed by a signing program. The signing program calculates a digest over the file and then enciphers the digest with a private signing key. This results in a signature for the file. And what use is the signature? Well, the recipient can use a verifier program to compare the file and its signature to see if they match. And it needs the public key corresponding to the private signing key to do this. If it's verified, that is the verifiers calculation of the digest over the relevant portions of the file matches the signed digest and the signature block, we know that the file is three things. Authentic, that it was published by somebody who knows the private signing key. Integral, that hasn't been subsequently modified. And if the publisher only signs some quote "good" unquote files, that it meets some quality bar as per their release process. 'Cause security maturity is improving the demand for signed files, and hence code signing, is increasing quickly. This is driven by hardware manufacturers enabling secure boot in their devices, and hence the need for signed firmware. Ecosystem evolution in operating systems like Windows, which is imposing evermore stringent requirements on what software is signed. And interest in software supply chain security by end customers who wanna validate things like packages and containers and so on before they ingest them into their own environment. Which means we need to do this at scale and securely. Sounds easy. But because you are at a data protection talk at AWS re:Inforce, you already know that cryptography merely changes what you have to worry about from a data integrity problem to a key management problem. And key management in the context of code signing can go very, very wrong. Here's the obligatory newspaper headline slide in a security talk. Not gonna go over this, but it appears in these cases either the private signing key was disclosed and leaked or the key was otherwise used to sign a malicious or otherwise unauthorized binary. And these can lead to really bad days if you maintain signing keys. So in order to prevent dad bays, if you'll permit me to suggest a few informally stated best practices if you are trying to protect your code signing keys by designing or adopting a code signing system. First, we alluded to these before. Key protection comes in two flavors. Disclosure protection, disruption protection. What I mean by disclosure protection is simply keep copies of the key from walking away, the plain text of the private key being accessible to a malicious actor, an attacker. 'Cause if a attacker obtains the plain text of your private key, they can sign files that appear to be authentic and this negates the security assurance of code signing in the first place. One way to assure yourself, or up your assurance of private key protection terms of disclosure is to use a hardware security model or HSM. These are physical appliances which provide assurance that the plain text of the private key is not gonna leave the cryptographic boundary of the HSM. Other degraded alternatives might be, as we chatted about earlier, keeping the private key in some rich computing environment such as a VM or server somewhere, but then the private key is exposed to the entire trusted computing base of that environment. Another best practice, prevent destruction of the private key. Simply means keep backups. Ensuring that there's an adequate backup strategy for all your keys helps prevent some really awkward conversations when you can no longer sign your software, especially if your keys can't be rotated. For hardware manufacturers, this can be especially acute if they've enabled secure boot on their devices because destruction of their code signing keys can mean either no more software updates or a very expensive recall to bring the devices back and reprogram the key that's used to validate software. Even if you've protected the plain text of the private key, you still need to implement some usage control. Each key will have different authorized signers or service accounts, and different operations, sign, verify, encrypt, decrypt, whatever, that you can perform on the signing key. Some use cases may require multi-party approval. That is a group of people who have to come to some kind of consensus about whether a file is signed or not. This can be, multi-party approval can be used to implement some quality checks. Another best practice signing quality bar, putting some scans and checks in the pipeline before signing. If your organization has a release management process, since has some criteria for what files can be released, this is a convenient enforcement point. 'Cause after signing the files can walk out the door. Finally, ensuring you have logging and monitoring. Having a durable log of what file was signed at what time with which keys by whom can be useful for audit and compliance. And it can also be really helpful in incident response when your organization's name shows up on newspaper headlines like we saw in the previous slide. Note that I'm not getting into topics like key rotation, certificate lifetimes, revocation, that kind of thing. But if you wanna come talk to me about those topics after the talk, please do. So, building a system like this is non-trivial. Usually outside the scope of any one development team that has a different mission. Hence, NVIDIA's product security division maintains several different code signing services that address varying requirements of varying sensitivities. One of these, the Simple Signing Service is aimed at lower sensitivity keys where signing at cloud scale is interesting and important. We wanted to call it, by the way, S3, the Simple Signing Service, but apparently S3 was taken by some important large organization or something, so we flipped it around, call it 3S. And since 3S was aimed at lower sensitivity at scale implemented in AWS, and so we explored using AWS CloudHSM for 3S's HSM backend. In the third part of this talk, Karthik will describe some of the more advanced requirements and problems we encountered along the implementation journey. But first I'll set the stage by describing 3S in general and some of the relevant features of CloudHSM. So 3S is a simple signing service. It provides a simple signing pipeline. An authorized submitter logs in. They're authenticated by our IDP. They submit an asset in the project that they have rights to sign in. The file is dispatched to that project's backend, and after some quality checks and sanity checks, the signing program on the backends gets the file's digest signed with the keys held in CloudHSM. The asset is returned to the submitter and a copy is retained. I'll pause here and just note, you'll note multiple per-project signing backends. Each of these backends is customizable per project or per file type. So you might have one that knows how to sign Windows applications, one that knows how to sign Mac applications, one that knows how to sign Linux packages and so on. And they're all isolated from each other. As a turnkey service, 3S tries to cover the best practices that we discussed a few slides ago. Keys are backed up in an HSM and not extracted during signing. Each project is isolated from each other's, as we saw on the previous slide with the multiple backends. Usage control, authorization is tied to NVIDIA's IDP. And there's fairly tight rule-based authorization for both users and service accounts that need to sign. Multi-party approval is also implemented, and Karthik will go over its design in detail. It's fairly flexible. There's a diverse set of signing environments. As you might appreciate, if you need to sign a Windows binary, you're typically going to use Microsoft SignTool and SignTool needs to run in a Windows environment. Correspondingly, the Mac signing tool wants to run in a Mac environment and typically GPG will run best in a Linux environment, so on. And so 3S's flexibility to support different environments for backends. And it's scalable to large volumes. As Dmitry mentioned, we sign over 60,000 files a day. We've signed over a million assets over the last two years. Finally, it's monitored 24-7 by NVIDIA Security Operation Center and all files are retained for subsequent analysis. We talked about how using an HSM is a best practice and 3S is implement in AWS. So we looked at using AWS CloudHSM. Some of the features that were relevant for us for code for signing. But first, just to reinforce that CloudHSM is a bare metal offering. You're effectively renting a dedicated HSM appliance in an AWS data center. There's no shared tenancy, there's no virtualization, nor does AWS have access to the appliance during your tenancy. Instead, what's provided is a series of management services around the HSM appliance, which you rent exclusively. So some of the features that might be relevant to you if you're considering a code signing service. Scalability. Tens, hundreds, millions of signing operations per day are supported and CloudHSM operates in the cluster model. You can scale up automatically the number of automatically duplicated instances in the cluster. If an instance fails, it's replaced and its state is restored. We have dozens of HSMs across various clusters. Second, and this is often overlooked when you're considering an HSM solution, is SDK support. We mentioned the various signing tools, SignTool, the Mac one, Linux whatever. They need to send a digest to a remote HSM. They don't have direct access to the private signing key. And so an HSM vendor will provide you with a series of plugins, PKCS 11, KSP, OpenSSL engine, JCE, that enable your signing tool to be able to interact with the HSM across the wire. Third, the CloudHSM service provides automatic backups. So every 24 hours the HSM state is backed up. It's encrypted with keys held inside the HSM that are not available to AWS and there's a forced retention period for seven days, and I'd suggest you implement additional mitigations around retention. Finally, there's a reasonable set of features to address insider threats and insider protection, which Karthik will go over in a few moments. Karthik? - Thank you Daniel. Thanks Dmitry. Daniel and Dmitry gave us a good walkthrough of what a CloudHSM is and kind of how we built of our code signing service, AKA 3S. I'm here to walk you through some of the interesting problems that we encountered while we went through the code signing service journey and how we solved that for a subset of NVIDIA's code signing needs. Here are some of the problems, and I'll walk you through each of those and, you know, tell you how we resolve that, right? And these set of problems are pretty relevant for any organization or anyone who has to go through the code signing service, right? These are key problems that you might have to solve one way or the other. Let's get to the first one, isolating projects. What do I mean by isolation? So as Daniel mentioned, our code signing service supports 30 different, or 40 plus signing workflows, right? Now we are at a point wherein we chose CloudHSM as our key storage. But the key problem is how do we make sure we isolate the keys that respond to different signing projects, right? And you might ask, why do we need isolation? If you think about the doomsday scenario, sorry, you know, threat actors and bad day scenarios, right? If an attacker gets access to a specific signing key or a specific signing workflow, we wanna make sure we contain the blast radius. And that is exactly why we isolate the signing workflows. And how do we do that? Fortunately, CloudHSM offers a feature called crypto users. What is a crypto user? A crypto user in a CloudHSM is the one that you can create to handle the keys. Let's take an example. In the Windows signing project, we want to sign Windows artifacts. And let's take another example where we wanna sign Linux artifacts, right? Now what we do is we get to the CloudHSM, and then we create the crypto user 1. And the role of the crypto user 1 is to hold the keys that are specific to Window signing project. So we log in as the crypto user 1, create the keys, and then we are done with the key creation part. But how do you map these credentials to your signing service so that it can be accessed at runtime to sign the artifact? That's where we come up with fine-grained access policies. So that only the Window signing projects, in this case it's a lambda, have access to the crypto user 1 credentials, which is stored in a secure location. And same goes for Linux signing as well, because what we do is we create a different crypto user, isolation, and then we create the keys that correspond to Linux signing and draft and access policy to map it to a different signing project, right? So we have technically isolated two different signing workflows. So in a scenario where crypto user 1 credentials are compromised by any means, we limit the blast radius only to Window signing projects. The Linux sending doesn't get affected. That's the whole concept, right? But we still have one problem. What is the problem? The problem is, unfortunately, in CloudHSM construct of crypto user, the crypto user can do much more than accessing the keys to sign. What do I mean by much more, right? The crypto user has the access to share the keys with other crypto users. And the crypto user, in some cases, can delete the keys. That is non-optimal, right? Because we are mounting these credentials or accessing these credentials from our signing service, and we do not want to do anything other than signing, right? Unfortunately, CloudHSM does not have any construct that we can use to get it off this problem. So instead of what we did is we leveraged a feature called ShareKey in CloudHSM to resolve the scenario. What we essentially do is, let's go back to the same example. Linux signing project. So we first create a crypto user, let's say crypto user 2A in this case, who's the key owner. And the key owner's responsibility is to create the keys. So we get into the system, create the 2A crypto user, and create the keys that are required for the next signing. And then we create an additional crypto user, 2B, who's the key user. And that is the distinction between this and the previous approach, where we have isolated the key owner from the key user. And what we then do is we share the key from that crypto user 2A to the 2B, right? So we create the user, share the key, and once we share the key, we lock out crypto user 2A. And why do we do that? We lock that user out so that even by accident or by insider threat the 2A user credentials doesn't get leaked so no one can technically get into the system and delete the keys. Because crypto user 2B is a key user, not the key owner, the crypto user cannot share the key, cannot delete the key. The user can only do the signing operation. So we take the crypto user 2B and come up with access policy, map it to the Linux signing project. Problem solved. This is how we make sure we prevent key loss in a, you know, bad case scenario. The next problem, how do we secure admin access? In any HSM there is a construct of admin user, CloudHSMs has one, and it's called crypto officer. As the name implies, crypto officer has far more features and actions than compared to crypto user. The crypto officer, in addition to what the crypto users can do, the crypto officer can create new users, delete users. Which means, if in the case of internal attack or in case of a compromise, crypto officer can do much more of a wider damage, right, to the system. How do we solve that? That is where we leverage CloudHSMs quorum authentication, which is M-of-N access control. I'll walk you through the exact sequence of how that happens. And that's one of the way we solve that issue. The second way is we make sure we lock down the HSM access. So the only services are the nodes that need HSM access are the signing services. No manual access is permitted. No other access is allowed, right? So we lock down HSM and then we add the quorum authentication on top of this lockdown to kind of handle the rogue crypto officer scenario. Let me walk you through the workflow. Pretty simple, straightforward. The prerequisite here is, let's say we get a CloudHSM in AWS. We have to go ahead and create the crypto officers because the whole process revolves around getting a quorum, right? As a crypto officer, if someone has to do something, it has to be approved by other crypto officers. That's the whole, you know, workflow. So now, after we get the CloudHSM, we create the crypto officers. The crypto officers in our case go through training, sign offs. And then what they do is they register themselves with the username and password to CloudHSM. Once they registered their credentials, we also would want them to create the public-private key pair. So once you let them create a key pair and register that with CloudHSM, we have met the prerequisites. We can go ahead with the whole, you know, sequence. So what we do? So let's take an example. Let's say the crypto officer 1 has to reset a user password, right? A valid scenario. But it's a sensitive operation. So what the crypto officer 1 would do in this case is, the crypto officer 1 would first request a quorum token from the CloudHSM. CloudHSM then responds with a quorum token, which the crypto officer 1 takes it and sends it to other crypto officers to get a review, right? (coughs) Excuse me. The other crypto officers now would validate the use case of what this crypto officer 1 was trying to do. And then the other crypto officers, if they validate the operation and everything looks good, they would approve the token. What do you mean by approve the token, you might ask. They would just sign the token with their private key. Remember the public key is already registered in the system. Now, after approving the tokens, all the other crypto officers would send back the token to the crypto officer 1. The crypto officer 1 now gets all the tokens and applies the token onto the CloudHSM. CloudHSM would then validate the tokens. Basically it's gonna make sure the tokens are signed with the right. Once it does that, it's gonna enable crypto officer 1 to do one quorum-controlled operation. In this case, it's a user management, which is crypto officer 1 would just go ahead and reset a user's password. So this is how using a construct provided by CloudHSM, we make sure crypto officer does not have unlimited privilege, right? The crypto officer has to go through approval process to do sensitive operations in CloudHSM. Okay, great. Now we have solved isolation, right? Project's keys are isolated. They also solve admin access. But what if someone decides to sign a artifact that is not production ready, right? What if there is an insider threat and someone takes a malicious code and signs it with production keys? How do we prevent that? So in this problem, the requirement is we need to make sure the file is validated before we sign it. How do we do that? So there are two primary requirements for the case. Signing absolutely cannot proceed until the approvals come through. So let's say if Karthik decides to do the signing, Karthik has to wait on approvals in 3S from other, you know, users, other admins of 3S to go ahead with the signing, right? The difference between this workflow and the previous workflow is the previous workflow happens at the CloudHSM level wherein we tap into the quorum authentication and enable the quorum control operation, right? But in this use case, unfortunately CloudHSM doesn't offer a way to gate the signing until the approvals come through. So what we did is we did a workaround and solved it at the application level. What do we do here is we developed a multi-party approval system, essentially analogous to what CloudHSM does, but this is at the application level, where when someone submits an asset for signing, the signing goes through multiple approvals before, you know, it gets to the signing phase. And another key issue that we solve here is the crypto user credentials. So remember we had crypto user 2 to be mounted that to Linux signing project, right? But it's not in the spirit of least privilege and, you know, mitigating unauthorized usage. We don't want that. We wanna make sure the credentials are available in the signing service only during signing, right? And in this case it should be available only after the approvals come through, right? That is the second problem that we are solving here. This is an example, right? So two prerequisites here, right? First thing, registration. The users, in this case users of our code signing service, the 3S admins, go into the system and create the signing workflow, and also configure the approvers. So it's very signing workflow specific. So for example, let's say Linux signing service, it's let's say NVIDIA thinks that's a sensitive workflow. We, as in the 3S admins, when we create the signing workflow in 3S, we also configure the approvers. And the approvers go ahead and register the public key. So now approvers are all set. Excuse me. What we then do is credential sharding, right? Remember, we cannot expose those credentials. We cannot, you know, make the credentials sit in the Lambda or the container, right, until the approvals come through. So what we do is we run the automated scripts to get those CU credentials, shard it, and we encrypt those credentials with the approver keys. So what you've essentially done is encrypted those credentials, right? Unsharded it. And then the whole sequence starts. Let's say, as a user inside NVIDIA, I have to sign something, and I submit a signing request to 3S. If this signing workflow has multi-party approval enabled, what then happens is instead of signing the file, 3S would notify the approver saying, "Hey approvers, someone has request to sign a file, right? Please review the file and tell me whether or not to go ahead with it, right?" The approvers then download the approval token from 3S and signs it with their private key, right? So once they do that, they send back or submit the approval token in 3S. So let's take a good use case scenario where everything looks legit on all the approvers, sign the token, and send it back to 3S. Now 3S would take those tokens, verify the signatures, and then it would decrypt those shards, reassemble those shards, and hence by it recreates the whole crypto user credential, right? Now the signing service is not gated anymore. It'll just go through with the signing operation now it has got access to the CU credentials. And this is how we make sure the multi-party approval is enabled for some specific signing workflows. Let's talk about security operation center. So the SOC team is foundation for any organization security posture. And it's the same for NVIDIA. Specific to 3S, what the SOC team does is they will analyze all our operations, CloudHSM logs, right? Application logs. And what they do is they monitor for unauthorized access. You know, they will flag all unintended events, right? And if there is a security incident, if the team thinks there is a security incident, they would spin up a security incident, involve the 3S team, work with us end to end to spin up an incident, mitigate the risk, and resolve the incident. And then finally they also work with us to define the procedures for administrative actions to make sure we don't run into the same issue in the future. More preventive measure, right? To enable our SOC team, they need data. They need data from CloudHSM, and you know, our signing service. And thankfully since our application is cloud enabled, we can just use AWS native log tools like CloudTrail to enable this pipeline. So what happens here is we get the logs from the CloudHSM and the signing service, we add CloudTrail, it flows into our S3, slightly longer term storage. And then we have log aggregators running. And the log aggregators' responsibility is to review the log information, annotate the data, and send it along to the next pipeline, which is our SOC pipeline, right? And the three major things that the SOC pipeline does, detect, alert, and resolve. Detection is detecting or looking for security threats, unauthorized events. And once they do, it's all automated by the way. And when they do, we get alerted saying, "Hey 3S team, there is a security incident that you might have. Please look into it." So when an incident is spun, and we will work with them to figure out the mitigation strategies, resolve the incident, and drive it to a conclusion. And then they would help us close the incident. And you know, the usual things, you know, RCCS and other things go into this. But technically we have enabled the SOC team to parse, to do incident management by looking at our logs, right? The key takeaway here is that we are able to leverage the cloud-enabled AWS tools to just empower our PSOC team to improve our security posture. And let's take a use case before we move on to the next slide. So remember the whole quorum authentication that we talked about where a crypto officer has to reset a user password. We went through the whole approval flow. Even that flows into the PSOC pipeline, right? The SOC team would analyze those logs to make sure the crypto officer 1, who requested to reset the user password only resetted that user's password, didn't do anything else, right? So that is an example of how we use SOC pipeline, even with all the other secondary controls enabled. The last problem that we encountered in our code signing service journey was large volume use cases. So Daniel did an excellent job describing how the pipeline works, right? To retread, builders are the one who initiates a signing request, gets to the portal. Portal will trigger a specific backend, let's say Linux or Windows-specific backend, and then which internal would use crypto user credentials to access the keys in CloudHSM, sign it, everything looks good. There were, when we put this pipeline to use, there were two use cases that didn't work. First use case is file sizes, in some cases, can go up to 15 to 20 gigabytes. So what do we do? What is the problem, right? You know, 3S could still support it, right? The problem is, in this workflow, the file gets from the builder to our signing service, we sign it, we send it back, right? The user has to download the file. So the latency over a period of time, it adds up. It's huge, right? And these are specific problems that we encountered for significantly low sensitive signing workflows, right? So why do we have to make the user incur the cost if the signing workflow is not that sensitive, right? That's the first problem, which is huge file sizes. The second problem is too many files to sign. There were some workflows that were calling 3S at the rate of hundreds and thousands of signing requests, right? And again, to remind you, these are relatively significantly low sensitive signing assets, right? So why do we need to put them through the pipeline, right? We thought about how we solve this, right? One way that you can think of is, hey, you claim that CloudHSM is, sorry... Your system is cloud native, right? Why don't you just scale it? Sure we could, we did. But again, beyond certain point scaling the system doesn't give you that much returns. Given the fact that this is a low sensitive asset, right? The second thing is, for large file sizes, you can argue that why do you bother sending the whole file? Why just send the digest and get it signed? Yeah, we tried that too. That work some use cases. But for other use cases it doesn't because in verification of the signed artifacts, some signing workflows, they have to verify the entire package. They cannot just trust the digest and you know, confirm that everything is signed. So it didn't work for all workflows, it did work for some. So we took a step back and thought about this and questioned ourselves. Why not just completely bypassed 3S, right? And that's what we did. That's what we call as direct to HSM. As the name implies, the builders that are high volume, low sensitive workflows, they directly talk to the CloudHSM via a controlled path. So that is the direct to HSM. And the implication here is the signing is done locally on the builders. They don't have to upload, they don't have to download, they don't have to go through the pipeline, right? Perfect. Great. But there is one caveat. What is the caveat? When we bypass 3S, we completely got rid of the logging monitoring pipeline, right? Now the builders just talked to the HSM, signed the artifact. And we need metrics, right, to troubleshoot what's going on with the system. And that required us to come up with a dedicated telemetry pipeline to address this use case. Here is a high-level architecture of Direct2HSM. Simple. Straightforward. On the left, you see the data center, right? Data center has a core network and we have specific builder subnets and specific builders that handles significantly low sensitive assets. So these builders, we have fine grain firewall rules, we connect these builders to CloudHSM. And how do we do that? With AWS Direct Connect. So we use Direct Connect to patch our data center with the cloud VPC that has HSM in it, right? And we enable everything. So let's say signing request comes in. The builder no longer talks to 3S. The builder would just access the keys from the CloudHSM. They technically run the SignTool in case of Windows. And with the PKCS libraries, it directly talks to the HSM, gets the keys signed set locally, call it done. So we resolve the the large volume use cases and huge file sizes. Let's get to the telemetry part because in this pipeline, if there was an issue, right, let's say DirectConnect is down, or the outbound traffic from NVIDIA has some issues, there is some IT issue. How do we know where the problem lies, right? And also the problem could be inbound to the HSM. So we have to provide more metrics to our users to efficiently troubleshoot what's going on with the system. And that is where we created an end-to-end telemetry pipeline. I'll walk you through the pipeline. Starts with the builder. The builder, when it gets an event for signing, it's gonna create another telemetry event and do HTTP POST to our API gateway. The API gateway would then process that event, and it gets to the Lambda post-processing, and then it sends the data onto the streaming pipeline powered by Kinesis. So from the Kinesis, it gets to our S3 bucket, slightly longer term storage and we can, you know, spin off new workflows from S3, so it's the idea. So the transform phase begins with S3 and we have Glue crawlers defined that crawl the data based on interval and they transform the data into a data catalog in AWS. And the final piece of the transform phase is the Athena. So Athena, for the users that don't know, Athena is gonna do SQL queries on the data. So simple, straightforward, right? And the final, the most important piece here, is Amazon QuickSight. So the QuickSight is responsible for creating the reports that would let the users see some actionable metrics, actionable data, right? So that's what end-to-end pipeline that involves starting from the builder, gets to the extract phase, gets to the transform phase, and then we visualize the data. And this workflow we enabled for direct use case to provide more actionable metrics. This is one of our sample dashboards that we have in QuickSight in production. So if you see the dashboard, right, it has a lot of metrics, like how many signing workflows came into the system on that specific day and how many of them passed. And it also provides more details onto which HSM had more failures. Because as Daniel mentioned, we have lot of HSM clusters. We need to pinpoint a failure to a specific HSM. And also it has metrics like which branch in our source code fail the most. What's wrong with it, right? And things like that. And other than the failure visibility, it also gives logging information. So the even data that the builder sent to this pipeline, the first block here, that even data is available in the same dashboard, so the user not only has the ability to figure out how something has failed, but also can switch tabs, go to another tab in the same dashboard and look at the log information on what exactly failed. So the final piece of the QuickSight is the alerting. So the whole point is to provide actionable feedback. And let's say, if the failures breach a certain limit, someone needs to get paged, on call, right? QuickSight allows us to notify users via email on, you know, "Hey user, something is wrong. So please go and fix it." So with this direct to HSM QuickSight metrics, we not only have enabled the low volume signing, but also provided the metrics that they can act on. So before we wrap up, I would like to share the key takeaways. I hope by now you believe that code signing, securing the code signing keys is really, really important. Daniel talked about the bad day scenarios, remember? The hacks and things like that that could go wrong. So it's paramount to an organization's security posture. And it starts with building a code signing service. Building a code signing service, again, is a journey. The journey that involves two phases. Application requirements. You need to figure out who are the users of her system, how do they access the system, right? And things related to the application. And then you need to worry about security because this is a security application. You need to figure out how do I expose the credentials to the client? How do I create those fine grained access policies, right? So it's a journey. And the journey starts with choosing an HSM. In our case, we chose CloudHSM. So choose an HSM and then build your code signing service on, right? And the code signing service has to be scalable. And trust me on this, you can start simple, but it scales really fast. So a lot of users who want to use your signing service to enable end-to-end signing, to have a better security posture for your organization. So you have to build your code signing service to be scalable from the get go. Make sure it's scalable, available, durable, five nines, all those, you know, cloud native stuff. And finally you have to think about how do I secure my signing workflows? Remember we discussed about a scenario where crypto user credentials get leaked. You need to come up with threat models and scenarios where your service might be compromised, your credentials may be compromised. How do you handle that? And also think about what do I do to enable significantly lower sensitive signing asset, large volume asset, and how do I enable, or integrate SOC pipeline into the code signing service, right? So that's the final thing that you have to think about before your users get to use the code signing service. And that brings us to the end. I hope this talk has given you a starting point to get started with your code signing service journey. And you know, we'd be happy to hang out in the hallways if you have more questions regarding what we did on CloudHSM in general. Thank you for your patience and please don't forget to complete the session survey. Thank you. (audience claps)
Info
Channel: AWS Events
Views: 760
Rating: undefined out of 5
Keywords: AWS, AWS Cloud, AWS re:Inforce, Amazon Cloud, Amazon Web Services, Customer stories, Privacy, Resiliency, data protection
Id: Abr_ANiVh5E
Channel Id: undefined
Length: 50min 22sec (3022 seconds)
Published: Mon Jun 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.