- [Dmitry] Hello and welcome. How's everyone doing tonight? - [Audience] Good. - Wonderful. Well, if you're in this
room, then you must be really curious about code signing,
hardware security modules, or crypto PKI tooling
because in this session you will learn how
NVIDIA built one of their code signing services using AWS CloudHSM. My name is Dmitry Kovalev. I'm Account Manager with
AWS and I'm delighted to introduce this team of
NVIDIAns, Daniel Major, the distinguished Security Architect, and Karthik Jayaraman, the
Senior Software Engineer, who will tell you about
their journey on AWS, as well as about their CloudHSM
use cases that very well might be applicable to your
particular scenarios as well. So here is our plan for
tonight, here is our agenda. In the best traditions
of the layered security, we will tell the story in layers. I'll first kick it off
with a brief introduction where I will talk high
level about the problem that NVIDIA was trying to
solve, as well as do a quick recap of CloudHSM just
to level set everyone. I'll then pass it over to
Daniel who will make one step deeper into the problem
and he'll do a code signing security primer as well as talk
about NVIDIA Simple Signing Services and the way how it
addressed their challenges. We'll then transition to Karthik, who will do a true 300-level deep
dive into the problem. In fact, he will talk
about five mini problems or use cases that NVIDIA
faced and how they leveraged AWS CloudHSM for their
code signing service. We'll wrap it up with
some closing thoughts and key learnings at the very end. And that will be our plan for tonight. So I hope you all are equally
excited as we are for this discussion, and let's go
ahead and get started. And I'd like to kick
it off with a question. What is the first thing
that comes to your mind when you hear the words NVIDIA? I'll just let you shout it out loud. GPOs, graphics. Yeah, all great starting
points, but I think many of you might be surprised that it's
actually way more than that. Because for the past 30
years, NVIDIA has been heavily investing into some serious
innovations revolving around artificial intelligence
and machine learning. They reinvented the modern
graphics with technologies such as path tracing and
deep learning super sampling. They created their own version of industrial metaverse called Omniverse. They turbocharged the science by creating the whole suite of
GPL-accelerated applications and libraries for genomics research. They pretty much became the
modern engine of artificial intelligence with technology such as CUDA and the advance of foundational
large language models. The last but not the
least to mention here, NVIDIA has been reshaping
the future of the autonomous vehicles with their drive platform. Now why all of that is important in the context of our discussion. If you look into those
areas on the slides, all of these areas require
software capabilities. And that means that NVIDIA
is writing a lot of software. And when I say a lot, I really mean it. Today NVIDIA has very
wide software portfolio that ranges from Windows
drivers and applications, to Linux kernel modules and
packages, to GPU firmware, to networking firmware and associated application systems and operating systems. NVIDIA writes the
autonomous vehicle software. They also do a lot of
attestation packages. And the list can go on and on and on. Today NVIDIA has more
than 400 various SDKs that they bring to this world. And all this software
needs to be protected. NVIDIA customers need to have
confidence that the software has indeed been developed
by NVIDIA and it hasn't been altered or tampered along the way. How do we do it? We do the code signing. We essentially digitally sign
or cryptographically sign the software to ensure that
it hasn't been modified. Now I have a question for you. If you were to build a
secure, highly available, highly performant code
signing service capable of scaling to more than 60,000 signing events per day, how would you do it? And how would you manage the cryptographic keys in this situation? One way of doing it is by
leveraging AWS CloudHSM. AWS CloudHSM is a hardware
security module that allows you to generate and use
cryptographic keys on AWS cloud. It essentially helps you to
meet your corporate contractual and regulatory compliance
requirements for data security by giving you a complete
control over your keys. And there are multiple
reasons why CloudHSM might be a good choice
for your deployments. Maybe you're looking for
the solution that is highly performant, that should meet
the requirements of your applications from performance standpoint, while at the same time addressing your reliability and
high availability goals. Maybe you're looking for
this solution that supports cloud elasticity by adding
and removing HSM instances, and also securely replicating
the keys between them and load balancing between
them to provide higher e-durability as well as improved capacity. Or maybe you're looking
for the way to demonstrate the compliance with the
key security regulations such as PCI, GDPR, HIPPA, or FedRAMP. Or maybe you're simply
looking for an open solution that supports the wide range
of cryptographic algorithms and standards such as PKCS 11,
Java cryptography extensions, JCE, OpenSSL, or CryptoAPI
CNG, just to name a few. At the end of the day, the main benefits that AWS CloudHSM gives
you, it provides you with low latency access
to a dedicated secure root of trust that is completely
under your control. So again, there are multiple
reasons why CloudHSM might be beneficial for your
deployment, and many of these reasons that you see on
this slide are the reasons why NVIDIA picked CloudHSM in their case. But I think nobody would tell the story better than NVIDAns themselves. So Daniel, why wouldn't
you take it over from here? - Thank you, Dmitry. Hello. My colleague Karthik and I are pleased to share with you today
the tremendous work of a team of very talented
engineers at NVIDIA. As Dmitry has mentioned, NVIDIA has a very large software portfolio. And as our industry
security maturity increases, much of the software needs to
be signed before it's released to customers or loaded
onto production devices. This means that organizations,
such as yours and NVIDIA, maintain a very large set of
software signing keys as well as internal systems to sign
software with those keys. A little bit later in the presentation Karthik and I will describe
one of those systems at NVIDIA, and some of
the problems we solved along the way that might be
relevant in your context. But enough about NVIDIA for the moment. Let's talk about you. How many of you here in
the audience think you know or can hazard to guess as
to where your organization stores your code signing keys? Hands up if you think
you can hazard a guess. Awesome, there's some knowledge
of people in the audience. Of those people, how many
think that your organization stores them in a hardware
security module or HSM? Cool. Good to see. How many store some of
your soak code signing keys in a secrets manager like
KMS, or a HashiCorp Vault, or CyberArk, or Secrets Manager? Even more. How many of your organizations
store your soak kining keys in a VM somewhere, signing
server in plain text? And how many think that you have copies of code signing keys in
source control somewhere? Was hoping not to see that. But it can be difficult for organizations to perform code signing securely. But just to make sure
we're all on the same page, let's do a quick primer on
what we mean by code signing and what it means to do it securely. Code signing is when a file,
usually some binary code, which is why we call it code signing, is processed by a signing program. The signing program calculates
a digest over the file and then enciphers the digest
with a private signing key. This results in a signature for the file. And what use is the signature? Well, the recipient can use
a verifier program to compare the file and its signature
to see if they match. And it needs the public key corresponding to the private signing key to do this. If it's verified, that is
the verifiers calculation of the digest over the relevant portions of the file matches the signed
digest and the signature block, we know that the
file is three things. Authentic, that it was
published by somebody who knows the private signing key. Integral, that hasn't been
subsequently modified. And if the publisher only signs some quote "good" unquote files, that it meets some quality bar as per
their release process. 'Cause security maturity is
improving the demand for signed files, and hence code signing,
is increasing quickly. This is driven by hardware manufacturers enabling secure boot in their devices, and hence the need for signed firmware. Ecosystem evolution in operating systems like Windows, which is imposing evermore stringent requirements on
what software is signed. And interest in software
supply chain security by end customers who wanna validate things like packages and containers and so on before they ingest them
into their own environment. Which means we need to do
this at scale and securely. Sounds easy. But because you are at
a data protection talk at AWS re:Inforce, you
already know that cryptography merely changes what you have
to worry about from a data integrity problem to a
key management problem. And key management in the context of code signing can go very, very wrong. Here's the obligatory newspaper headline slide in a security talk. Not gonna go over this, but
it appears in these cases either the private signing
key was disclosed and leaked or the key was otherwise
used to sign a malicious or otherwise unauthorized binary. And these can lead to really bad days if you maintain signing keys. So in order to prevent dad bays, if you'll permit me to suggest
a few informally stated best practices if you are
trying to protect your code signing keys by designing or
adopting a code signing system. First, we alluded to these before. Key protection comes in two flavors. Disclosure protection,
disruption protection. What I mean by disclosure
protection is simply keep copies of the key from walking away,
the plain text of the private key being accessible to a
malicious actor, an attacker. 'Cause if a attacker
obtains the plain text of your private key, they
can sign files that appear to be authentic and this
negates the security assurance of code signing in the first place. One way to assure yourself,
or up your assurance of private key protection
terms of disclosure is to use a hardware
security model or HSM. These are physical appliances
which provide assurance that the plain text of the
private key is not gonna leave the cryptographic boundary of the HSM. Other degraded alternatives might be, as we chatted about earlier,
keeping the private key in some rich computing
environment such as a VM or server somewhere, but then the
private key is exposed to the entire trusted computing
base of that environment. Another best practice, prevent destruction of the private key. Simply means keep backups. Ensuring that there's an
adequate backup strategy for all your keys helps
prevent some really awkward conversations when you can
no longer sign your software, especially if your keys can't be rotated. For hardware manufacturers,
this can be especially acute if they've enabled secure
boot on their devices because destruction of
their code signing keys can mean either no more software
updates or a very expensive recall to bring the
devices back and reprogram the key that's used to validate software. Even if you've protected the
plain text of the private key, you still need to implement
some usage control. Each key will have different
authorized signers or service accounts, and different
operations, sign, verify, encrypt, decrypt, whatever, that you
can perform on the signing key. Some use cases may require
multi-party approval. That is a group of people
who have to come to some kind of consensus about whether
a file is signed or not. This can be, multi-party
approval can be used to implement some quality checks. Another best practice signing
quality bar, putting some scans and checks in the
pipeline before signing. If your organization has a
release management process, since has some criteria for
what files can be released, this is a convenient enforcement point. 'Cause after signing the
files can walk out the door. Finally, ensuring you have
logging and monitoring. Having a durable log
of what file was signed at what time with which keys by whom can be useful for audit and compliance. And it can also be really
helpful in incident response when your organization's
name shows up on newspaper headlines like we saw
in the previous slide. Note that I'm not getting
into topics like key rotation, certificate lifetimes,
revocation, that kind of thing. But if you wanna come talk
to me about those topics after the talk, please do. So, building a system
like this is non-trivial. Usually outside the scope
of any one development team that has a different mission. Hence, NVIDIA's product
security division maintains several different code
signing services that address varying requirements of
varying sensitivities. One of these, the Simple
Signing Service is aimed at lower sensitivity keys where signing at cloud scale is
interesting and important. We wanted to call it, by the way, S3, the Simple Signing Service,
but apparently S3 was taken by some important large
organization or something, so we flipped it around, call it 3S. And since 3S was aimed at
lower sensitivity at scale implemented in AWS, and so we explored using AWS CloudHSM for 3S's HSM backend. In the third part of this
talk, Karthik will describe some of the more advanced
requirements and problems we encountered along the
implementation journey. But first I'll set the stage
by describing 3S in general and some of the relevant
features of CloudHSM. So 3S is a simple signing service. It provides a simple signing pipeline. An authorized submitter logs in. They're authenticated by our IDP. They submit an asset in the project that they have rights to sign in. The file is dispatched to
that project's backend, and after some quality
checks and sanity checks, the signing program on the
backends gets the file's digest signed with the
keys held in CloudHSM. The asset is returned to the submitter and a copy is retained. I'll pause here and just note, you'll note multiple
per-project signing backends. Each of these backends is customizable per project or per file type. So you might have one that
knows how to sign Windows applications, one that knows
how to sign Mac applications, one that knows how to sign
Linux packages and so on. And they're all isolated from each other. As a turnkey service, 3S tries to cover the best practices that we
discussed a few slides ago. Keys are backed up in an HSM and not extracted during signing. Each project is isolated
from each other's, as we saw on the previous slide
with the multiple backends. Usage control, authorization
is tied to NVIDIA's IDP. And there's fairly tight
rule-based authorization for both users and service
accounts that need to sign. Multi-party approval is also implemented, and Karthik will go over
its design in detail. It's fairly flexible. There's a diverse set
of signing environments. As you might appreciate, if
you need to sign a Windows binary, you're typically going
to use Microsoft SignTool and SignTool needs to run
in a Windows environment. Correspondingly, the Mac
signing tool wants to run in a Mac environment and typically GPG will run best in a Linux
environment, so on. And so 3S's flexibility to support different environments for backends. And it's scalable to large volumes. As Dmitry mentioned, we sign
over 60,000 files a day. We've signed over a million
assets over the last two years. Finally, it's monitored 24-7
by NVIDIA Security Operation Center and all files are
retained for subsequent analysis. We talked about how using
an HSM is a best practice and 3S is implement in AWS. So we looked at using AWS CloudHSM. Some of the features that were relevant for us for code for signing. But first, just to reinforce that CloudHSM is a bare metal offering. You're effectively renting a dedicated HSM appliance in an AWS data center. There's no shared tenancy, there's no virtualization, nor
does AWS have access to the appliance during your tenancy. Instead, what's provided is a
series of management services around the HSM appliance,
which you rent exclusively. So some of the features that
might be relevant to you if you're considering
a code signing service. Scalability. Tens, hundreds, millions of
signing operations per day are supported and CloudHSM
operates in the cluster model. You can scale up automatically the number of automatically duplicated
instances in the cluster. If an instance fails, it's replaced and its state is restored. We have dozens of HSMs
across various clusters. Second, and this is often overlooked when you're considering an HSM
solution, is SDK support. We mentioned the various signing tools, SignTool, the Mac one, Linux whatever. They need to send a
digest to a remote HSM. They don't have direct access
to the private signing key. And so an HSM vendor will provide you with a series of plugins, PKCS 11, KSP, OpenSSL engine, JCE, that
enable your signing tool to be able to interact with
the HSM across the wire. Third, the CloudHSM service
provides automatic backups. So every 24 hours the
HSM state is backed up. It's encrypted with keys held
inside the HSM that are not available to AWS and there's
a forced retention period for seven days, and I'd
suggest you implement additional mitigations around retention. Finally, there's a
reasonable set of features to address insider threats
and insider protection, which Karthik will go
over in a few moments. Karthik? - Thank you Daniel. Thanks Dmitry. Daniel and Dmitry gave
us a good walkthrough of what a CloudHSM is
and kind of how we built of our code signing service, AKA 3S. I'm here to walk you through
some of the interesting problems that we encountered
while we went through the code signing service journey
and how we solved that for a subset of NVIDIA's
code signing needs. Here are some of the problems, and I'll walk you through
each of those and, you know, tell you how we resolve that, right? And these set of problems
are pretty relevant for any organization or anyone who has to go through the code
signing service, right? These are key problems that you might have to solve one way or the other. Let's get to the first
one, isolating projects. What do I mean by isolation? So as Daniel mentioned, our
code signing service supports 30 different, or 40 plus
signing workflows, right? Now we are at a point wherein we chose CloudHSM as our key storage. But the key problem is how
do we make sure we isolate the keys that respond to
different signing projects, right? And you might ask, why
do we need isolation? If you think about the doomsday
scenario, sorry, you know, threat actors and bad
day scenarios, right? If an attacker gets access
to a specific signing key or a specific signing workflow, we wanna make sure we
contain the blast radius. And that is exactly why we
isolate the signing workflows. And how do we do that? Fortunately, CloudHSM offers
a feature called crypto users. What is a crypto user? A crypto user in a CloudHSM is the one that you can create to handle the keys. Let's take an example. In the Windows signing project, we want to sign Windows artifacts. And let's take another example where we wanna sign Linux artifacts, right? Now what we do is we get to the CloudHSM, and then we create the crypto user 1. And the role of the crypto
user 1 is to hold the keys that are specific to
Window signing project. So we log in as the crypto
user 1, create the keys, and then we are done with
the key creation part. But how do you map these
credentials to your signing service so that it can be accessed at
runtime to sign the artifact? That's where we come up with
fine-grained access policies. So that only the Window
signing projects, in this case it's a lambda, have access
to the crypto user 1 credentials, which is
stored in a secure location. And same goes for Linux signing as well, because what we do is we
create a different crypto user, isolation, and then we create
the keys that correspond to Linux signing and draft
and access policy to map it to a different signing project, right? So we have technically isolated two different signing workflows. So in a scenario where
crypto user 1 credentials are compromised by any means, we limit the blast radius only
to Window signing projects. The Linux sending doesn't get affected. That's the whole concept, right? But we still have one problem. What is the problem? The problem is, unfortunately,
in CloudHSM construct of crypto user, the crypto user can do much more than accessing the keys to sign. What do I mean by much more, right? The crypto user has the access to share the keys with other crypto users. And the crypto user, in some
cases, can delete the keys. That is non-optimal, right? Because we are mounting these
credentials or accessing these credentials from
our signing service, and we do not want to do anything
other than signing, right? Unfortunately, CloudHSM
does not have any construct that we can use to get
it off this problem. So instead of what we did is
we leveraged a feature called ShareKey in CloudHSM to
resolve the scenario. What we essentially do is, let's go back to the same example. Linux signing project. So we first create a
crypto user, let's say crypto user 2A in this
case, who's the key owner. And the key owner's responsibility
is to create the keys. So we get into the system,
create the 2A crypto user, and create the keys that are
required for the next signing. And then we create an additional crypto user, 2B, who's the key user. And that is the distinction between this and the previous approach,
where we have isolated the key owner from the key user. And what we then do is we share the key from that crypto user 2A to the 2B, right? So we create the user, share the key, and once we share the key,
we lock out crypto user 2A. And why do we do that? We lock that user out
so that even by accident or by insider threat
the 2A user credentials doesn't get leaked so
no one can technically get into the system and delete the keys. Because crypto user 2B is a
key user, not the key owner, the crypto user cannot share
the key, cannot delete the key. The user can only do
the signing operation. So we take the crypto
user 2B and come up with access policy, map it to
the Linux signing project. Problem solved. This is how we make sure we prevent key loss in a, you
know, bad case scenario. The next problem, how do
we secure admin access? In any HSM there is a
construct of admin user, CloudHSMs has one, and
it's called crypto officer. As the name implies,
crypto officer has far more features and actions than
compared to crypto user. The crypto officer, in addition to what the crypto users
can do, the crypto officer can create new users, delete users. Which means, if in the
case of internal attack or in case of a compromise,
crypto officer can do much more of a wider damage,
right, to the system. How do we solve that? That is where we leverage
CloudHSMs quorum authentication, which is M-of-N access control. I'll walk you through the exact sequence of how that happens. And that's one of the
way we solve that issue. The second way is we make sure
we lock down the HSM access. So the only services
are the nodes that need HSM access are the signing services. No manual access is permitted. No other access is allowed, right? So we lock down HSM and
then we add the quorum authentication on top of this lockdown to kind of handle the rogue
crypto officer scenario. Let me walk you through the workflow. Pretty simple, straightforward. The prerequisite here is, let's say we get a CloudHSM in AWS. We have to go ahead and create the crypto officers
because the whole process revolves around getting a quorum, right? As a crypto officer, if
someone has to do something, it has to be approved by
other crypto officers. That's the whole, you know, workflow. So now, after we get the CloudHSM, we create the crypto officers. The crypto officers in our case go through training, sign offs. And then what they do is
they register themselves with the username and
password to CloudHSM. Once they registered their
credentials, we also would want them to create the
public-private key pair. So once you let them create
a key pair and register that with CloudHSM, we
have met the prerequisites. We can go ahead with the
whole, you know, sequence. So what we do? So let's take an example. Let's say the crypto officer 1 has to reset a user password, right? A valid scenario. But it's a sensitive operation. So what the crypto officer
1 would do in this case is, the crypto officer 1 would first request a quorum token from the CloudHSM. CloudHSM then responds
with a quorum token, which the crypto officer
1 takes it and sends it to other crypto officers
to get a review, right? (coughs) Excuse me. The other crypto officers now
would validate the use case of what this crypto
officer 1 was trying to do. And then the other crypto officers, if they validate the operation
and everything looks good, they would approve the token. What do you mean by approve
the token, you might ask. They would just sign the
token with their private key. Remember the public key is
already registered in the system. Now, after approving the
tokens, all the other crypto officers would send back the
token to the crypto officer 1. The crypto officer 1
now gets all the tokens and applies the token onto the CloudHSM. CloudHSM would then validate the tokens. Basically it's gonna make sure the tokens are signed with the right. Once it does that, it's gonna enable crypto officer 1 to do one
quorum-controlled operation. In this case, it's a user management, which is crypto officer 1 would just go ahead and reset a user's password. So this is how using a construct provided by CloudHSM, we make sure crypto officer does not have unlimited privilege, right? The crypto officer has to
go through approval process to do sensitive operations in CloudHSM. Okay, great. Now we have solved isolation, right? Project's keys are isolated. They also solve admin access. But what if someone
decides to sign a artifact that is not production ready, right? What if there is an insider
threat and someone takes a malicious code and signs
it with production keys? How do we prevent that? So in this problem, the
requirement is we need to make sure the file is
validated before we sign it. How do we do that? So there are two primary
requirements for the case. Signing absolutely cannot proceed until the approvals come through. So let's say if Karthik
decides to do the signing, Karthik has to wait on
approvals in 3S from other, you know, users, other admins of 3S to go ahead with the signing, right? The difference between this
workflow and the previous workflow is the previous
workflow happens at the CloudHSM level wherein we tap into
the quorum authentication and enable the quorum
control operation, right? But in this use case, unfortunately
CloudHSM doesn't offer a way to gate the signing until
the approvals come through. So what we did is we did a workaround and solved it at the application level. What do we do here is we
developed a multi-party approval system, essentially analogous
to what CloudHSM does, but this is at the application level, where when someone submits
an asset for signing, the signing goes through
multiple approvals before, you know, it gets to the signing phase. And another key issue that we solve here is the crypto user credentials. So remember we had crypto user 2 to be mounted that to Linux
signing project, right? But it's not in the
spirit of least privilege and, you know, mitigating
unauthorized usage. We don't want that. We wanna make sure the
credentials are available in the signing service
only during signing, right? And in this case it should be available only after the approvals
come through, right? That is the second problem
that we are solving here. This is an example, right? So two prerequisites here, right? First thing, registration. The users, in this case users
of our code signing service, the 3S admins, go into
the system and create the signing workflow, and
also configure the approvers. So it's very signing workflow specific. So for example, let's say
Linux signing service, it's let's say NVIDIA thinks
that's a sensitive workflow. We, as in the 3S admins,
when we create the signing workflow in 3S, we also
configure the approvers. And the approvers go ahead
and register the public key. So now approvers are all set. Excuse me. What we then do is
credential sharding, right? Remember, we cannot
expose those credentials. We cannot, you know, make the
credentials sit in the Lambda or the container, right, until
the approvals come through. So what we do is we run
the automated scripts to get those CU credentials,
shard it, and we encrypt those credentials with the approver keys. So what you've essentially done is encrypted those credentials, right? Unsharded it. And then the whole sequence starts. Let's say, as a user inside
NVIDIA, I have to sign something, and I submit
a signing request to 3S. If this signing workflow has
multi-party approval enabled, what then happens is
instead of signing the file, 3S would notify the approver
saying, "Hey approvers, someone has request to sign a file, right? Please review the file and tell me whether or not to go
ahead with it, right?" The approvers then download
the approval token from 3S and signs it with their
private key, right? So once they do that, they send back or submit the approval token in 3S. So let's take a good use case scenario where everything looks
legit on all the approvers, sign the token, and send it back to 3S. Now 3S would take those
tokens, verify the signatures, and then it would decrypt those shards, reassemble those shards,
and hence by it recreates the whole crypto user credential, right? Now the signing service
is not gated anymore. It'll just go through
with the signing operation now it has got access
to the CU credentials. And this is how we make sure
the multi-party approval is enabled for some
specific signing workflows. Let's talk about security
operation center. So the SOC team is foundation for any organization security posture. And it's the same for NVIDIA. Specific to 3S, what the
SOC team does is they will analyze all our operations,
CloudHSM logs, right? Application logs. And what they do is they
monitor for unauthorized access. You know, they will flag all
unintended events, right? And if there is a security incident, if the team thinks there
is a security incident, they would spin up a security incident, involve the 3S team,
work with us end to end to spin up an incident, mitigate the risk, and resolve the incident. And then finally they also
work with us to define the procedures for administrative
actions to make sure we don't run into the
same issue in the future. More preventive measure, right? To enable our SOC team, they need data. They need data from CloudHSM, and you know, our signing service. And thankfully since our
application is cloud enabled, we can just use AWS native log tools like CloudTrail to enable this pipeline. So what happens here is we
get the logs from the CloudHSM and the signing service,
we add CloudTrail, it flows into our S3,
slightly longer term storage. And then we have log aggregators running. And the log aggregators' responsibility is to review the log
information, annotate the data, and send it along to the next pipeline, which is our SOC pipeline, right? And the three major things that the SOC pipeline does, detect, alert, and resolve. Detection is detecting or looking for security threats, unauthorized events. And once they do, it's
all automated by the way. And when they do, we get
alerted saying, "Hey 3S team, there is a security incident
that you might have. Please look into it." So when an incident is spun,
and we will work with them to figure out the mitigation strategies, resolve the incident, and
drive it to a conclusion. And then they would help
us close the incident. And you know, the usual things, you know, RCCS and other things go into this. But technically we have
enabled the SOC team to parse, to do incident management by
looking at our logs, right? The key takeaway here is
that we are able to leverage the cloud-enabled AWS
tools to just empower our PSOC team to improve
our security posture. And let's take a use case before we move on to the next slide. So remember the whole quorum
authentication that we talked about where a crypto officer
has to reset a user password. We went through the whole approval flow. Even that flows into the
PSOC pipeline, right? The SOC team would analyze those logs to make sure the crypto
officer 1, who requested to reset the user password only resetted that user's password, didn't
do anything else, right? So that is an example of
how we use SOC pipeline, even with all the other
secondary controls enabled. The last problem that
we encountered in our code signing service journey
was large volume use cases. So Daniel did an excellent job describing how the pipeline works, right? To retread, builders are
the one who initiates a signing request, gets to the portal. Portal will trigger a specific backend, let's say Linux or
Windows-specific backend, and then which internal would use crypto user credentials to access
the keys in CloudHSM, sign it, everything looks good. There were, when we put
this pipeline to use, there were two use cases that didn't work. First use case is file
sizes, in some cases, can go up to 15 to 20 gigabytes. So what do we do? What is the problem, right? You know, 3S could
still support it, right? The problem is, in this workflow, the file gets from the builder
to our signing service, we sign it, we send it back, right? The user has to download the file. So the latency over a
period of time, it adds up. It's huge, right? And these are specific
problems that we encountered for significantly low sensitive
signing workflows, right? So why do we have to make
the user incur the cost if the signing workflow is
not that sensitive, right? That's the first problem,
which is huge file sizes. The second problem is
too many files to sign. There were some workflows that
were calling 3S at the rate of hundreds and thousands
of signing requests, right? And again, to remind
you, these are relatively significantly low sensitive
signing assets, right? So why do we need to put them
through the pipeline, right? We thought about how we solve this, right? One way that you can think of is, hey, you claim that CloudHSM is, sorry... Your system is cloud native, right? Why don't you just scale it? Sure we could, we did. But again, beyond certain point scaling the system doesn't give
you that much returns. Given the fact that this is
a low sensitive asset, right? The second thing is, for large file sizes, you can argue that why do you
bother sending the whole file? Why just send the digest
and get it signed? Yeah, we tried that too. That work some use cases. But for other use cases it
doesn't because in verification of the signed artifacts,
some signing workflows, they have to verify the entire package. They cannot just trust
the digest and you know, confirm that everything is signed. So it didn't work for all
workflows, it did work for some. So we took a step back and thought about this and questioned ourselves. Why not just completely
bypassed 3S, right? And that's what we did. That's what we call as direct to HSM. As the name implies, the
builders that are high volume, low sensitive workflows, they directly talk to the CloudHSM
via a controlled path. So that is the direct to HSM. And the implication here is the signing is done locally on the builders. They don't have to upload, they don't have to download, they don't have to go
through the pipeline, right? Perfect. Great. But there is one caveat. What is the caveat? When we bypass 3S, we completely got rid of the logging monitoring pipeline, right? Now the builders just talked to the HSM, signed the artifact. And we need metrics, right, to troubleshoot what's
going on with the system. And that required us to
come up with a dedicated telemetry pipeline to
address this use case. Here is a high-level
architecture of Direct2HSM. Simple. Straightforward. On the left, you see
the data center, right? Data center has a core network and we have specific builder subnets
and specific builders that handles significantly
low sensitive assets. So these builders, we have
fine grain firewall rules, we connect these builders to CloudHSM. And how do we do that? With AWS Direct Connect. So we use Direct Connect
to patch our data center with the cloud VPC that
has HSM in it, right? And we enable everything. So let's say signing request comes in. The builder no longer talks to 3S. The builder would just access
the keys from the CloudHSM. They technically run the
SignTool in case of Windows. And with the PKCS libraries,
it directly talks to the HSM, gets the keys signed set
locally, call it done. So we resolve the the large volume use cases and huge file sizes. Let's get to the telemetry
part because in this pipeline, if there was an issue, right, let's say DirectConnect is down, or the outbound traffic
from NVIDIA has some issues, there is some IT issue. How do we know where
the problem lies, right? And also the problem could
be inbound to the HSM. So we have to provide
more metrics to our users to efficiently troubleshoot
what's going on with the system. And that is where we created an end-to-end telemetry pipeline. I'll walk you through the pipeline. Starts with the builder. The builder, when it gets
an event for signing, it's gonna create another telemetry event and do HTTP POST to our API gateway. The API gateway would
then process that event, and it gets to the Lambda
post-processing, and then it sends the data onto the streaming
pipeline powered by Kinesis. So from the Kinesis, it
gets to our S3 bucket, slightly longer term storage
and we can, you know, spin off new workflows
from S3, so it's the idea. So the transform phase begins
with S3 and we have Glue crawlers defined that crawl
the data based on interval and they transform the data
into a data catalog in AWS. And the final piece of the
transform phase is the Athena. So Athena, for the users that don't know, Athena is gonna do SQL
queries on the data. So simple, straightforward, right? And the final, the most important piece here, is Amazon QuickSight. So the QuickSight is
responsible for creating the reports that would let the users see some actionable metrics,
actionable data, right? So that's what end-to-end
pipeline that involves starting from the builder, gets
to the extract phase, gets to the transform phase,
and then we visualize the data. And this workflow we
enabled for direct use case to provide more actionable metrics. This is one of our sample dashboards that we have in QuickSight in production. So if you see the dashboard,
right, it has a lot of metrics, like how many signing
workflows came into the system on that specific day and
how many of them passed. And it also provides more details onto which HSM had more failures. Because as Daniel mentioned,
we have lot of HSM clusters. We need to pinpoint a
failure to a specific HSM. And also it has metrics like which branch in our source code fail the most. What's wrong with it, right? And things like that. And other than the failure visibility, it also gives logging information. So the even data that the
builder sent to this pipeline, the first block here, that
even data is available in the same dashboard, so the
user not only has the ability to figure out how something has failed, but also can switch
tabs, go to another tab in the same dashboard and look at the log information on what exactly failed. So the final piece of the
QuickSight is the alerting. So the whole point is to
provide actionable feedback. And let's say, if the failures
breach a certain limit, someone needs to get
paged, on call, right? QuickSight allows us to
notify users via email on, you know, "Hey user,
something is wrong. So please go and fix it." So with this direct to
HSM QuickSight metrics, we not only have enabled
the low volume signing, but also provided the
metrics that they can act on. So before we wrap up, I would like to share the key takeaways. I hope by now you believe
that code signing, securing the code signing keys
is really, really important. Daniel talked about the bad
day scenarios, remember? The hacks and things like
that that could go wrong. So it's paramount to an
organization's security posture. And it starts with building
a code signing service. Building a code signing
service, again, is a journey. The journey that involves two phases. Application requirements. You need to figure out who
are the users of her system, how do they access the system, right? And things related to the application. And then you need to worry about security because this is a security application. You need to figure out how do I expose the credentials to the client? How do I create those fine
grained access policies, right? So it's a journey. And the journey starts
with choosing an HSM. In our case, we chose CloudHSM. So choose an HSM and then build your code signing service on, right? And the code signing
service has to be scalable. And trust me on this,
you can start simple, but it scales really fast. So a lot of users who want
to use your signing service to enable end-to-end signing, to have a better security
posture for your organization. So you have to build
your code signing service to be scalable from the get go. Make sure it's scalable,
available, durable, five nines, all those, you
know, cloud native stuff. And finally you have to think about how do I secure
my signing workflows? Remember we discussed about a scenario where crypto user credentials get leaked. You need to come up with
threat models and scenarios where your service might be compromised, your credentials may be compromised. How do you handle that? And also think about what do
I do to enable significantly lower sensitive signing
asset, large volume asset, and how do I enable, or
integrate SOC pipeline into the code signing service, right? So that's the final thing
that you have to think about before your users get to
use the code signing service. And that brings us to the end. I hope this talk has
given you a starting point to get started with your
code signing service journey. And you know, we'd be happy
to hang out in the hallways if you have more questions regarding what we did on CloudHSM in general. Thank you for your patience
and please don't forget to complete the session survey. Thank you.
(audience claps)