Design a Payment System - System Design Interview

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in recent years e-commerce has exploded in growth e-commerce is basically trading of goods and services on the internet in exchange for a monetary payment and every such transaction is made possible by a payment system operating in the background however implementing a payment system is not a Leisure task reliability and correctness are critical furthermore successful businesses generate a lot of payment requests as they scale since even a small amount of downtime could mean a lot of lost Revenue availability is also essential in this video we're going to demystify the process of implementing a reliable and scalable payment system before diving into the interview questions it's useful to have a high level view on the payment systems used globally normally a customer places an order to a merchant website to complete this the customer has to provide the payment information so next the merchant sends the customer to the payment form page to introduce the payment details normally this form page is provided by the payment gateway to operate legally this service has to manage many compliance rules including PCI DSS and gdpr this Gateway also has several other functions for instance it can forward the request to advance verification services for risk and fraud prevention we'll discuss it later in more details so the main function of the payment Gateway is to validate Financial credentials and transfer them to a merchant bank account now the card holder information are transmitted to the acquiring Bank this is the bank that processes card payments on behalf of the merchant now we can define a payment service provider this is a broader term for third-party companies that assist businesses in facilitating payments safely and securely the PSP usually offers services such as risk management reconciliation tools and sometimes even services for order management PSPs can also be the acquiring bank but not necessarily next the acquiring Bank captures the transaction information performs basic validation and routes the requests along the appropriate card networks to the cardholder issuing bank for approval finally the customer Bank receives the transaction information and responds by approving or declining the transaction it can check that the transaction information is valid the card holder has sufficient balance to make the purchase and the account is in good standing and then the transaction follows the same route back to the merchant the merchant will receive a status of the transaction which is also displayed to the client first we'll need to establish the requirements of the system and Define what we want to achieve in the beginning of an interview it's critical to ask refining questions in order to clarify the functional needs the scope of the system and also the non-functional requirements the first question to ask is what kind of payment system are we building there are basically two ways to do it the most common use case is to build a payment system while using a PSP like stripe or PayPal this is normally used for online stores such platforms or any other platform that requires payments in simple terms a PSP moves money from the buyer account to the merchant account then the other way is to try to connect directly to Banks or card schemes such as Visa or Mastercard without a PSP however this direct connections are uncommon and difficult to establish to do that we'll have to comply to the following standards and regulations in general these are required to make online transactions secure and protect users against identity theft getting all the compliance it's not impossible yet it's very complicated to achieve and for every country the process is a little bit different so most companies even the large ones make use of the payment gateways using a PSP also imply that we don't have to store card data in our system that would be stored and processed by the PSP this spares us from implementing extremely high security systems using a PSP makes the payment system a little bit easier but we still need to take care of the logic to process transactions first a payment system interacts with a lot of internal services and external Services second when one of the services this fails we may see unconsistent States therefore we need to plan for transaction failures and perform reconciliation to fix any inconsistencies this leads us to the system requirements a payment system is easy to understand at the functional level it needs to move money from account a to account B however what's more difficult is to make the system reliable especially when unknown situations are revealed a small slip could potentially cause significant Revenue loss and there are a lot of things to consider when building a reliable payment system in this video we'll focus more on the technical Concepts as they are applicable to almost every system the business part will depend on each particular system also we'll see how we can handle large throughput of payment requests foreign let's say we need to build a payment system for an online store then we should provide at least the following core features first when a user clicks the place order button a payment event is generated and sent to the payment service this service will coordinate the payment process first it stores the payment event in the database second it will call an external PSP in order to process the card payment when we call the PSP we should provide at least the monetary amount and currency normally this is captured from the checkout page at this point the user will see the payment page this is where the payment details are collected usually there are two ways to provide this page we can use the form page provided by the PSP or we can build it in-house the second option is very uncommon and doesn't justify find the effort if we build a payment page ourselves it means that we also have to store sensitive payment information again this is a very tedious task normally it doesn't make sense to go through the whole compliance process unless you're a large company who can justify such an investment so the common way of collecting and forwarding payment data is via a form page provided by the PSP then the main function of the PSP is to send card details to Banks or card schemes after the PSP has successfully processed the payment the coordinator service updates the wallet in order to keep track of the account balance of the merchant we use the concept of wallet then the updated balance information of the wallet is also stored in the database after the wallet service is successfully updated the payment service updates The Ledger this involves logging all the financial transactions record by record we use this service in post payment analysis when calculating the total revenue of the e-commerce website or to support auditing finally The Ledger service opens the new information to a database internal and external Services need to talk to each other to exchange information and get things done there are two main communication patterns synchronous and asynchronous communication is synchronous when one service sends a request to another service and waits for the response before proceeding further in async communication a service doesn't wait for the response instead it continues with its own execution and it falls for the response at regular intervals of time or is notified about the response what shall we choose for a payment system as always there is no straight answer before comparing the two patterns we have to consider some facts first we have to consider that any of the involved Services can fail at any time maybe one of the services stopped responding because of an intensive workload or maybe some host servers have crashed second let's consider the channel of communication almost all communication between Services is done over the network this channel can be slow mainly because of congestions and it can also be unreliable requests could be lost on the network simply because someone unplugged a network cable in general many things could go wrong on the network considering these facts we might realize that the sync connection is not really fitted for most cases this is mainly because it's not tolerant to failure or big latencies and because it doesn't isolate the failure it will reduce the availability of the whole system so there's always a risk of cascading failure if the PSP or any other service fail the whole system is considered down and the client will no longer receive a response the same cascading effect apply device when one of the service is low here the color service is blocked until a response is received so if any of the services in the chain doesn't perform well the whole system is impacted these drawbacks are mainly a result of tight coupling between the services we almost always want to avoid strong dependencies between components this applies in programming and system design as well however in some use cases we cannot proceed without a response from the Upstream server for instance physical store payments require real-time authorization from the API we should know immediately if it's a success or a failure but we should use synchronous communication only if there is no other way in most cases we should prefer asynchronous communication because the services are Loosely coupled particularly for a large scale payment system with complex business logic and a large number of third-party dependencies asynchronous communication is a better choice furthermore asynchronous communication makes it easier to deal with uneven traffic and spikes in these systems we can make use of persistent queues such as Kafka that can act as buffers if the service encounters a sudden increase in traffic we store the requests and process them at constant Pace without blasting the service with a lot of requests at once we even have enough time to spin up some new servers in the background and then take the requests from the queue and process them at a faster pace moreover according to its official website Kafka is used by 7 out of 10 Banks and financial companies worldwide in conclusion async messaging is suitable in most cases for example online payments fraud detection and analytics this is because it's tolerant to failure and big latencies in a payment system we can encounter at least the following kinds of issues system failures here we have the usual Network and server failures poison pill errors when an inbound message cannot be processed or consumed and functional bugs where there is no technical errors but the results are invalid so implementing a reliable payment system comes with a lot of challenges the good news is that we have many tools at disposal to deal with the impediments in payment systems there are a lot of services involved that need to communicate with each other in order to complete a transaction but how can we guarantee that any of the request messages doesn't get lost on the network or guarantee that the message is received considering that we may find unavailable services to guarantee transaction completion we can use a messaging queue like Apache Kafka for any order replaced or paid we also create an event in Kafka this component will help us persist communication messages so that they are not lost even when things don't go as planned in this case the payment operation doesn't complete successfully until the event is safely stored in the message queue now we might consider that Kafka can also fail however since its job is so simple at this point it's just to store messages its availability is normally much higher than and other business related Services then these messages can be consumed individually by each interested service also it's the responsibility of the consumer to Mark the message as seen or consumed only when the message is successfully processed and stored in the database in this way once the sender successfully posts the message it is always stored on disk on at least one data store in addition this is most probably also replicated so using this pattern we can guarantee that messages will be delivered to the other services sometimes in order to complete a transaction we may need to get some information from other services for instance we may need to get the search results of a query in those cases it doesn't make sense to use a message queue still payments requests may still fail because of network issues and other problems so to make sure that the request navigates safely on the network we have to plan for the failure here we have some useful tools for example a customer may try to make a payment but the request fails because the network connection is unstable in those cases it makes sense to retry the operation because network problems are usually temporary and on the second or third attempt the request might succeed this is pretty straightforward to understand however we have to pay attention to the number of free tries and the appropriate time intervals between the retries let's discuss the time intervals here we may apply several strategies based on what failures we might encome counter the most basic retry implementation is to retry immediately after the failure however it's unlikely that the issue has been solved in such a short amount of time furthermore it's important to give a little bit of break to the cold service to recover if it was down otherwise we can waste Computing resources and also overload the system so we can retry at fixed intervals of times or better yet at incremental intervals of time now the system has a little bit of break to recover still at more advanced retry strategy is exponential back of retry here we double the waiting time between retries after each retry so usually it is recommended to use something similar to the exponential back of algorithm this way we can ensure that clients aren't hammering an overloaded server and contributing more to the problem going the extra mile when multiple services are dependent on one service it's also a good idea to mix an element of Randomness if a problem with one single service causes a large number of clients to fail at about the same time then even with backup strategy the retry schedules could be aligned closely enough that the retrace request will Hammer the trouble server so we can address this problem by adding some amount of Randomness or Jitter to each client wait time this will space out requests across all clients and give the server some breeding room to recover foreign is pretty straightforward to explain but it might not be easy to get right the goal of timeout is to avoid unbounding waiting times for a response when the time to respond is too high the operation is aborted and the request is treated as failed timeouts are used in almost every application to avoid requests that get stuck forever however dealing with timeouts is not trivial imagine an ordered placement failed in an online shop the buyer will see that the order wasn't processed but in the back end several things could have happened we cannot be sure if the payment was successful but the request response Timeout on the way back or if the request is still in progress in the payment system or if the payment system was not even reached all we know is that the request timed out so what status should we give to the customer if we Mark the request as failed then the customer might think that the order didn't succeed but maybe it did and he actually got charged but since there was no response the request was marked as failed what happens if the customer retries the operation in this case he might get charged twice later we'll see how to avoid double payments using the concept of hidden potency now we can retry the request ourselves in the back end using the strategies that we saw previously however again we have to be careful not to charge the customer multiple times here another question is how big should we set the timeout this will depend on each particular endpoint however we should set them high enough to allow slower responses to arrive but also low enough to stop waiting for a response that is never going to arrive foreign enables a service to continue its execution even if requests to another depending service are failing for example while making a payment a request might be sent to the fraud check service but let's say that this service returns an internal server error instead of aborting the whole computation because of a missing response we could fill in the fallback value otherwise if the service is down then no payment will go through so to avoid losing customers we can fall back to a simple business role for example if the amount is reasonably small we can simply let the transactions go through this is a compromise between risk and keeping the customers happy next we'll see what we can do if the fallback value is not acceptable some failures can persist for a couple of minutes or even hours what can we do in those cases we can simply cancel the request if the failure is acceptable business-wise however there could be more to the story in some systems an error May persist because the information is not compatible between the sender and the receiver in some cases the error is not retrievable because no matter how many times we resend the information it will always fail this incompatible messages are also called poison pill errors so to isolate the problematic messages we can save them for later for debugging for instance we can save them in a dedicated queue so that we get rid of the broken messages this pattern is also known as the dead letter Q later these messages can be inspected to determine why they are not processed successfully in other cases an error May persist because one of the services is down maybe for a couple of hours because of a serious problem in that case the error is retrievable so we may still want to accept the request because we know we can process them later when the field service recovers basically we store in a persistent queue all the transactions that have failed and have to be consumed later when the crashed service is up again we can pick the transactions from the queue and process them foreign if a payment request fails due to a network error or any other reason we should have a mechanism to safely retry the operation without charging the customer twice to achieve this we'll use the concept of hidden potency from an API perspective an item potent operation is one that has no additional effect if it's called more than once with the same input parameters let's see a usual scenario to understand how it actually works let's say that a customer makes a payment the payment goes through our payment system and is successfully processed by the PSP however due to network errors the response of the transaction fails to reach back to our payment system the user will get back an error so you will probably click again the pay button or retry the payment now how can we avoid double payments at this point will make use of an hidden important key this is usually a unique value that is generated as the client and expires after a certain period of time to perform an hidden bottom payment request an hidden potent key is added to the HTTP header in this form uuids are commonly used as hidden potency keys and many tech companies are using it including stripe and PayPal usually this uuid is also the ID of the payment order then when the receiving server gets the same payment details it will identify that is a retry operation to support hidden potency we can use the unique key constraint of any database when the payment system receives a payment it tries to insert a row into the database table a successful insertion means we have not seen this payment request before if the request fails it means the key is a duplicate so the second request will not be processed however as the key already exists we can return the latest status of the previous request moreover if multiple concurrent requests are detected with the same hidden potency key only one request is processed and the others will receive 4 to 9 too many requests status code in conclusion an hidden potency key is ensuring that exactly ones guarantee lastly if our payment system stores and serves a lot of data but is limited by a single machine we need to graduate to a distributed system here we can spin more database instances and also more payments Services then we can make use of the following benefits that it provides redundancy this is achieved using replication usually a distributed system can have multiple copies of data and processes this will help us improve reliability by providing a backup in case one component fails then in distributive systems we can distribute the workload across multiple machines which can improve reliability because we reduce the risk of a single component to be overwhelmed then distributed systems can be designed to be tolerant of failures allowing them to continue functioning even if one or more components fail and last we can easily scale up or down by adding or removing components which again help us improve reliability by allowing the system to handle increased or decreased workloads however we should be aware that in a distributed system the communication between any two nodes can fail causing data inconsistency replication lag could cause inconsistent data between the primary database and the replicas so we should be aware of what consistency level we use when we read or write some data one of the most effective ways to protect data is encryption first we should encrypt data at rest this involves converting data into a secure format that cannot be read without a key this can be done using software tools for disk encryption or database encryption then we should encrypt data while it's being transmitted over a network such as the internet this can be done at multiple levels we can use a VPN this will secure and encrypt connections between a device and a network then at a higher level we can use the TLs protocol this provides confidentiality data integrity and authentication for the data transmitted between two parties such as the client and the server TLS should not be confused with SSL which is deprecated since 2011. then at a higher level we'll use https to trans meet data all these tools can be used to protect the data that is transmitted over a network but they each have different Scopes and use cases then we can Implement Access Control this involves restricting access to data only to authorize users and then using methods such as two-factor authentication to verify the identity of the users then to avoid vulnerabilities we should regularly update software libraries and the operating system this involves keeping all software up to date with the latest security patches and updates then we should backup data if attackers manage to get or see the encrypted data they can encrypt it again and then ask for a payment in exchange for the encryption key however if we regularly backup the data we can ensure that it can be recovered in case of loss or damage finally users should use uncommon passwords for common passwords like the word password it's very easy for the attacker to guess the actual password if they have the the encrypted version they can use a rainbow table which is basically a pre-computed table of reversed password hashes rainbow refers to the different colors used in the table to show the various hashing and reduction functions so a user should use long complex passwords that are difficult to guess or to crack monitoring data Integrity is a powerful security technique to secure business data against both known and unknown threads in this process we check if any changes have been made to vulnerable data we assess the files of the databases and file systems and then generate a cryptographic checksum as a baseline then regularly we recalculate the check sum of the same resources compared to the Baseline and if we detect changes we generate a security alert this way we can even detect malware within the operating system or other applications however this process can be very resource intensive especially while dealing with large amounts of data so it's crucial to monitor data and files that are more vulnerable to cyber attacks so that you invest your resources efficiently this data can be user credentials preview Villages and identities or encryption key stores operating system files configuration files and application files for a payment system reliability and fault tolerance are key requirements to tackle these requirements we discussed how to make use of the following tools but not only redundancy to enable resilience during internal system failures patterns for payment guarantee by using Kafka capabilities to persist messages so that they are not lost even if the messaging system crashes then strategies for retry timeouts and fallbacks to make the system robust and predictable then messaging queues in order to avoid overloading the system and it important message handling to allow clients to retry requests as needed not doing so could leave data in an unconsistent state or worse in double payments
Info
Channel: High-Performance Programming
Views: 272,122
Rating: undefined out of 5
Keywords:
Id: olfaBgJrUBI
Channel Id: undefined
Length: 31min 39sec (1899 seconds)
Published: Wed Jan 11 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.