Hardware and Software Architecture of The Machine

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning thanks everybody for coming to my presentation this morning it's a good follow-on to a Willys presentation just before we talked about the low level hardware that we're using that we're hoping to use it are in our machines eventually I'm going to talk today about an ongoing research project at Hewlett Packard Enterprise the machine how many of you have heard of the machine awesome I guess our marketing pitch is working I am going to describe the hardware a lot of the a lot of the online literature you see about the machine describes it in very vague terms and in order to talk about software I do need to describe the hardware that we're building I'm going to try to give you enough detail for you to actually go out and build your own not quite but you know enough to actually understand what we are building and then I'm going to go to a lot more detail about the software the awesome part about the software is every time I get to present about the software it's completely different because we've we spent a bunch more time between the last time I presented about it building the software so last time I talked about this was in London in December and so in the last two months we've done a huge amount of work and so I've got a lot more content so the hardware part of my talk gets shorter and shorter and the software talk gets longer and longer I've been listening over the over the last four days to people asking me questions about the machine and other people doing work in similar areas and so I've tried to add some content update my slides in the why I added another slide during the last talk to try to give you a more more relevant information so what is the machine well we started out by noticing that we're a bunch of interesting hardware technologies coming online at a very in a very similar period of time soces are getting more plentiful and with many more threads in them and you can buy SOC s to do a lot of different things and so and by SOC I mean just a computational in unit so anything that does computation but doesn't do it doesn't have storage in it we're starting to be able to build a photonic interconnect right to the silicon and by that I mean actually taking a piece of glass fiber and gluing it to the top of your CPU and the awesome part about that is you get rid of the the latencies of copper and the power consumption of copper the diameter of copper because the signals transfer factor in when they are in copper and so photonics gives you a huge number of advantages and the other thing is happening recently is a change in the way that we're storing data we've had D Rams for a long time how many of you started out your careers programming on machines without DRAM yeah see I'm going to use a persistent memory as we called it back in the day core memory yeah I used a machine with card memory Willy was talking about total system persistence in his talk and the dream of being able to turn a machine off and back on and have it keep going I had that on my pdp-11 we did that in 1979 so we're getting work back to the future as often is the case so we're talking about these three different technologies kind of converging together and so at HP HP a Hewlett Packard enterprise we're trying to merge trying to kind of synthesize what a computer would look like if all it wants all these technologies are available so we have the electrons driving our processing and we have photons transmitting our data and now we're talking about using using memristor which uses ions to store store information persistently and we're moving from this processor centric world where you have a network connecting all of your processors and each processor has a tiny little pool of data connected to it and in order to do computation and large amounts of data you spend an enormous amount of time transferring data around this this crazy architecture when the data is really the heart of your enterprise excuse me I'm channeling my corporate corporate namesake I guess so the problem with this architecture is that if you have a problem which doesn't fit in any one of these individual processors memory and width and where the data needs to be shared among all the computational elements you're completely bound up by this by this communication overhead and so if we take the memory out of the out of the edge of the computer and bring it into the center of the computer and we talk about something of a memory driven computing architecture where we have the memory as a center of the machine and the processor is communicating with each other through that memory all of a sudden the scale of the problem you can attack is dramatically increased so here we have the classic computing model where you have applications and file systems and networking systems in this enormous operating system I mean I love Linux it's a great system but the least the fewer CPU cycles I spend down there the faster my application can run so when we talk about a memory driven computing architecture we're talking about the application having access to all of the storage in the machine through a load store interface I don't have to do I owe operations at all all of a sudden this tradition that we've we've come up come up with in the database world where you have an application that opens up a file that talks to a disk driver that talks to a disk and what does that application want to do it wants to access that data with load store instruction so it memory Maps it and so now we have this huge overhead of every time you want remembering that file you have to talk about M sync and all this huge IO Layton sees what we really want to do is have the application talking directly to the storage and just getting rid of Linux getting it out of the way so you have an application talking directly to your fabric that's what we really want to get to so how are we building this well we're not building it in that we're not building a giant pool of memory that is separate from your processing what we're building instead today is we're building the next generation memory interconnect which is a memory fabric and so what that means is instead of keep connecting our processors and network and memory through a PCI bus or through Nix or anything anything like that we're actually constructing an interconnect that connects processors in memory at the memory level so right at the fastest bus on your processor we're connecting all of the processors and memory together so for our first instantiation of the machine we're actually constructing these individual nodes that have a combination of computation and storage and they're all connected together through the next generation memory interconnect which is a switch based fabric that connects all these devices together and that means instead of communicating through a nic from processor to process or a memory of memory now I can communicate directly through the memory interface so imagine getting rid of the PCI bus in your machine and having everything go through the fastest interconnect you have that that dram interface except we're using a wide variety of different memories each node on this and this architecture has these three different pieces it has a compute complex on the node and it has a memory complex on the node and it has this switching Rick the interesting thing about this is the compute complex is not is not directly managing that memory at all the storage complex on this on this system is part of the larger fabric because the compute complex talks through the fabric to even the memory local to its node which means that when you have a large collection of nodes in the machine as shown in this picture you'll see that the SOC s have to communicate through the fabric to all of us all of the memory so when the when a remote SOC wants to communicate with stuff and go forward a couple of slides here with a remote oh come on Thank You OpenOffice or Libre Office as this is of course so when a node of the machine wants to communicate with storage either what an SOC a1 node wants to communicate with storage on another node it goes to the fabric without ever talking to the SOC on that other node so all of a sudden your communication overhead is the software part of your communication overhead is eliminated and the SOC s get communicate with all of the storage and machine independent of the processing elements in those nodes so one of the things we can do with this architecture is we can actually get rid of the SOC on any of those nodes and just put more storage there we can get rid of the storage on those nodes and put more computation we could we could have a heterogeneous computation in this environment we don't have to use all of the same processor so when Willie was talking talking at the end of his talk about oh well we don't really have to worry about heterogeneous computing because who would who would be crazy enough to put a big-endian CPU and a little endian CPU connecting to the same env dim as well we are those crazy people we're planning on making a system where you can have any any computational unit you want connected to the memory of the system directly and so your persistent memory becomes your communications and collaboration fabric so we talked about a community of computing well here's your community of computing it's you all share the same memory everybody gets to work on the same data all at once what is this going to look like at hardware here's the node that we're building for the first instantiation of the machine the first instantiation the gene has a little SOC it's a 64-bit ARM processor it has the FPGA we're using to implement the first version of our next-generation memory interconnect and that talks so that you can see those actually separate boards there's a little gray space between those between the compute complex on the right and the memory complex on the left and so this actually communicate through the fabric and there's a cable that goes between them to do that computation okay so what are the capabilities of this machine okay so the SOC has a little bit of local DRAM and that SOC is actually going to run a version of Linux right in that DRAM and we're going to treat that that next generation number you're going to interconnected fabric memory as a giant device so how much memory do we have well where the SOC has a small amount only 256 gigabytes of local ram and then connected over the fabric each node in the machine will have four terabytes of ram ok so you have a four terabyte device and it's 256 gigabyte local DRAM how many of these were going to connect together because remember I can interconnect as many as I want well I can connect a lot of them here's an enclosure of systems that has has eight eight systems in an enclosure so that's 32 terabytes of RAM that I can stack those enclosures into a rack ten rack ten enclosures high for a total of 80 processors and memory connections so the first the first version of the machine I'm going to build has 320 terabytes of storage of memory accessible storage so I'm going to be able to support problems that today you can you can't even really contemplate putting in a single address space what's what's one of the fundamental problems with 320 terabytes it's bigger than 256 terabytes which is the largest processor I can buy except from IBM so I have a fundamental problem there ok so now I can build a rack of those and of course what I really want to do is I really want to build a data center full of these all connected at the address base level thousands of machines to take over and and and run all of your data the address space that we're constructing is actually 75 bits so actually constructing a 32 zettabyte address space so in theory I can connect an awful lot of these computers as I said that 320 byte is is just the kind of the prototype that we're building that we can build today we plan to build larger machines in the future so when Willy was talking about 6 terabytes as being an interesting large amount of storage yeah that's kind of a good start although from HPE today you can actually purchase a machine with 24 terabytes of RAM and that's what we're doing a lot of the prototyping in I try to remember what the corporate the commercial name of that thing is not forgotten yeah there's a machine that you can purchase from HPE today that has 16 processors in 24 terabytes of RAM in a single system image and we're doing a bunch of development on that for the software on the machine which have been very useful so we have this fabric attached memory there's 320 terabytes of it the problem is is that our processor can't access all of that at once another problem is is that you have 80 different computers talking to the same storage and so you want to be able to protect the memory from one from each of the processors one of the fundamental principles of the machine architectures that we provide security at a hardware level underneath the operating system how many of you have had your Linux systems exploited in the end and the operating system compromised over the network yeah all of us we are putting Hardware protections underneath the operating system to protect the storage from your operating system so when the operating system is compromised it can only compromise the memory that was granted access to and not the entire storage on the machine so we're going to actually construct a mapping between the physical addresses them in the in the SOC and the and the bus addresses on the fabric with a dynamic physical to fabric mapping now why are we having to do that yeah well because I have a as much only only as much as a 48 bit physical address coming out of the CPU and I need more than 48 bits on the fabric to talk to my 320 terabytes of memory so I'm actually having to put another memory management interface between the processor between the physical addresses of the processor and the memory it's sitting on and that turns out to be really painful um we're doing uh we're the storage I'm going to show you some more slides here yeah I'm going to go over here and show you this slide so here here's kind of how memory is accessed in the machine you had the SOC core the arm 64 bit arm it generates virtual addresses 48 bits it generates physical addresses somewhere between 44 and 48 bits and then that that goes out over the cache coherence interconnect out of the processor and we plug into that and we take the physical addresses coming out of that and map them into these logical Z addresses that go out on the next generation memory interconnect and so those are 53 bits we get to play with so the current the current architecture is the current FPGA stuff that we're doing is limited to a mere 8 petabytes I mean that goes out over the firewall does our permissions checks and then we translate the the 53 bit logical Z address and before we transmit it and we send it out as a 75 bit 75 bit Z address on the fabric so the key here the some of the interesting parts here is that the PA to LZ a transition this this mapping that we're doing which operates in either which does the translation is separate from the notion of protection and with this Hardware protection the firewall down there and that firewall actually physically protects the memory from the SOC and that's each of these firewalls has a table of permissions bits for all of the memory within the fabric right now we support eight up to eight petabytes and so we don't want to use very fine granularity because every single firewall has to have a linear' table of these things so we're actually using the protection unit we're using we're calling a book it's a collection of pages what do you call a collection pages you call it a book and the books are small amounts of memory in our world only eight gigabytes apiece and so every firewall has a table of which eight gigabyte book each SOC can access and those are programmed outside of the realm of the operating system we'll show you how the software does that in a minute so we actually physically protect all of the memory in the machine from the SOC in the operating system itself with this firewall one of the things that the machine is very focused on this is the slide that I just added so it may seem a little out of sequence is one of the main focuses that we have here is for security in this environment we don't trust anybody in our world I'm sure you don't trust anybody in your world either so what we're currently doing is we're actually encrypting all of the data in the memory so if you pull the memory out of the Machine and and steal it away to another machine and plug it back in you there's nothing you can do with it the data is encrypted that in that device the keys or provision when the machine starts up so the machine is powered down or reset the memory is completely unavailable to everybody because it's encrypted what we're hoping to do in the future is actually encrypt the data at the SOC level so that every piece of memory that an SOC talks to has a key that's specific to that SOC or group of SOC is working with that memory so if you want to be able to forget what that memory contains all you have to do is discard the key and that memory is no longer readable by anybody on the planet on those keys of course would be provisioned at boot time and updatable the goal of course is to make it so that if that machine if any piece of that machine is is being monitored by another agent either within the machine or without the machine that that data that we're transferring will be encrypted at all times we're trying to make it as secure as we know how okay so I've told you about the Machine Hardware how the hardware works you understand that's a giant network of computers that communicate through this fabric we also do wire them up with Ethernet because we figure somebody's going to want to be able to get data in and out of this device and we'll show you one really interesting use of that in a while what I want to talk about what we're doing for an operating system well when we decided to do an operating system it was pretty easy to know that we're gonna have to support at least one operating system and that one operating system of course was Linux there are other people thinking about doing other operating systems Oh company Redmond might do their operating system on our hardware there's a bunch of crazy researchers doing you know next-generation Hardware you know really talked about replacing the memories being a reason to do new operating system well we're replacing the memory and a bunch of other stuff so we got a bunch of other crazy OS researchers off doing off doing lalala things meanwhile back in the real world we're doing Linux on the machine all the operating system work that we're doing is going to be free software operating systems we're not doing anything which is a free software of course of course so we're modifying Linux to support our crazy hardware it's actually not a huge modification if you think about it if you think about it the right way each of these nodes in the compute in the machine has a processor and it has a little bit of local DRAM 256 gigabytes as a NIC and it has this weird-ass device this 320 terabytes of the available memory that it can map so if you think about it it's just a computer with a strange device attached and think okay and run Linux on the computer then I got a write device driver for crazy device not a really big deal so from a kernel perspective getting the kernel running of this thing isn't a huge amount of work we have a little device driver that talks of the crazy device and we have some user space stuff that knows how to talk with that so the kernel change is necessary to running this are fairly modest however we need to build a larger system that actually manages the entire enclosure the system and manages allocations of that underlying hardware so we have to have libraries that talk to that stuff Willy talked about the Lib PM stuff we're doing reporting the PM and a whole bunch of other crazy libraries to support this stuff um in order to allocate and share that that resource of 320 terabytes we're creating a new file system the file system is not a high performance file system designed for rapid transactions it's a file system allocates allocate storage an eight gigabyte units we expect that you're not going to do that very often so it is the simplest possible file system we could construct it would barely work and I'll show you how that works in a while we have a interesting architecture for abstracting data communication between the nodes of the machine a lot of people using rocky or RDMA for transmitting memory in a in a cluster environment we're taking that the our DMA API is that people have seen for for applications like that providing an abstraction on top of it and I'll talk a little more about that in a while and of course it's an 80 node prop and 80 node effectively looks like an 80 node cluster and so we have a bunch of cluster management utilities they get OSS running on the various things and do data logging all that stuff and I'll talk about that as well so here is our current in current version of the system software for the machine what we've built today what we're actually running tests on internally we have a management server that does all of our shared management right now what that shared management does is it runs that it runs a daemon that supports this shared allocation we call it the librarian because once you have books full of memory of course you have to have a librarian that manages your books that was it was kind of fun when our when our recent college graduate engineer came up with this whole library metaphor and it was like okay I suppose we can use that it'll be kind of fun but I thing is the library librarian as a term isn't used in Linux but of course we really don't want to talk about a library of books because library is kind of a heavily overloaded term in our world so we avoid that and in fact a collection of books in our world is called a shelf so we don't call it a library because otherwise it would be very confusing inside Linux user space on each node we have a bunch of different little systems that we've added we have some a control interface to the library and file system we have a bunch of new libraries we have pmm that PM m dot o io process project is working on that intel's running we have a regular POSIX API s we have this r vm a stuff that we're building or we going to talk about and of course we have an Atomics library you're saying why do we need special support for Atomics well the big problem that we have with an 80 node cluster of processors with 320 terabytes of RAM over many a you know our goal is to be able to build a data center full of this it's not cache coherent which from a hardware perspective is awesome the hardware architects like who know cache coherence how awesome is this in a software guys are like oh my god we're going back to the 80s this is really hard so we the hardware actually does atomic support in the fabric itself so you can do atomic transactions in the fabric to expose that up to user space today we're providing a library of course we'd really love for the processors to talk to the stuff directly and maybe they will someday so the big part that we're building right now is the is librarian that's what we focused a bunch of time on it's our allocation our storage allocation system it's a machine wide which is to say all 80 nodes in the current prototype share the same allocator to to the global storage it's actually running on a separate machine now when you're building a new computer and if you have to have a system which is going to manage all of the nodes in the machine the thing we decided that we didn't want to trust or count on was that each of those nodes would be reliable from power-on so we actually actually are just taking an existing dl380 server the HPE builds and sticking in the same rack with all the rest of this stuff and it's called our top of rack management server whether it actually fits in the same rack or not as unknown but it's a separate machine that's sitting on the same network but it doesn't have access to the fabric so here we have a file system whose metadata is running in a separate machine on the network and the data is stored in active is it only accessible to the things which is not the it's kind of weird file system the communication between the nodes which within the machine and the top of rack management server for the library and file system is all of our TLS of course everything is secure there are only metadata operations of course the top of rack management server not being a participant in the machine has no access to the fabric which means that I can't store the metadata in the fabric itself and in fact I'm storing the metadata for the file system in a sequel database talk about high speed file systems well those are other things that that is not this thing the other thing that the top of rack management server is doing with this of course is it has to manage all those firewalls remember those permission blocks that control access for the nodes the to the fabric yeah the librarian is actually responsible for programming those the way that happens a little Wiggy I think actually have a slide on that let's see like where am I in my presentation here I can't even tell so the goal here is to is to provide a kind of a two-level allocation scheme where the librarian is responsible for mannequin managing memory in larger units the gigabyte books and then we provide systems underneath that that manage allocation at a finer granularity for instance one thing you could obviously do is take one of these librarian one of these librarian collections of books that we call shelves and create an ext4 file system on that fairly straightforward plan and then the ext4 file system would run within a particular node and talk to those books obviously another thing you'd really like to be able to do as a high-performance fine granularity distributed file system that ran across all the nodes in the machine anybody got one of those that runs on a shared memory environment across 80 nodes nobody here by the way if you know how to build such a thing and are interested in job yeah come talk to me I want that and I can't have that today so we're hoping to be able to build that we're also building a bunch of other what we call retail memory brokers or retail allocators a lot of them are like the things you saw on the lib pmm talk earlier we have an object store so we're hoping there's going to be a whole interesting line of research and how you do allocation across operating system instances using shared memory shared persistent memory of course so hoping to be hoping hoping to spark a bunch of research in that area ok so here's how the fabric attached memory is actually managed within the machine you have this top-of-rack management server that has a librarian has has some other services up there that manage manage things like user IDs and passwords and that kind of stuff and then within Linux user space in every node you have this weird collection of little processes these are all actually written in Python we have the the library and file system proxy and we have the firewall proxy and you're thinking wait a minute the operating system isn't supposed to be in charge of its own firewall because we're supposed to be protecting the memory from the operating system so what's the operating system doing in the middle of the communication between the library in the firewall what turns out the way our hardware is built we're going to stick the firewall controller down in the arm trust zone the problem with that is a trust zone has no way of talking over the network yet because we don't have yeah it's all very convoluted so what we're actually going to do is we're going to pipe all the firewall communications through the firewall proxy but they're all going to be encrypted so the firewall proxy has no idea what they are and so the OS is going to be transmitting the data but not interpreting the data and we have some glorious plans about how if the OS fails to actually send the firewall commands we're eventually going to kill it we have a we have a power switch for the top of rack management server can turn the power off on the nodes so if the node fails to cooperate we'll just shoot it in the head the library and file system we want to do secure communications with the top of rack management server that's really hard to do from the kernel TLS is knots I'm not currently available in the kernel so what we're doing is we're actually implementing the librarian filesystem by by forking fuse so we have this weird fuse module in the kernel that does our librarian filesystem in it forwards all the all the metadata operations up to users base to this to this librarian file system proxy which then in wrap wraps them all in TLS bits and ships them over the network so it's kind of a ouija system but it turns out to be kind of a convenient development environment because most of the file system work we're doing we're actually writing in Python a user space on both sides of the link which is kind of a convenient development environment and then of course within the kernel we have all these other pieces that is communicating we have the file system itself which is this fused clone this fused fork and then of course if you want to build a local file system you're going to create the user loop duet loop block device driver to stick a loop block a block device on top of one of these library and file system objects so the librarian file system talks about shelves and of course when you map that into POSIX space a shelf turns into a file so all of a sudden you have this global resource which can now be treated in the POSIX space as a convenient file so with that POSIX API you can open the shelves or files you can set their size with F truncate or POSIX F allocate you can map them into your own memory and because we've written all of our own device stuff in the kernel and map works awesome so unlike the the talk the stock willy was talking about where you had to hack the file systems to support direct mapping well that's all our system wants to do so our file system directly supports MX so we don't need any of the stuff that that the Linux kernel community is working worrying about Dax at all it just works natively with our file system of course the allocation unit you get with our file system is kind of big even with 320 terabytes of memory there's only 40,000 books available so you can only have 40,000 files in our file system that makes for a lot of simplifying assumptions in our file system to design there's no sparseness all the POSIX locking API yeah that's no local so if you have multiple nodes trying to do file locking sorry we didn't do that so we cut out a bot of a bunch of the semantics because we're building a research vehicle here we're not building a product it's very convenient right it's like oh yeah our researchers promise to never need that awesome we won't do that yeah there may be some additional semantics required when we take this thing into production maybe who knows here's a diagram you can't read but the slides will be available online you can see it later you can see the interactions between the kind of path of interactions that we did when we design the library and file system between the application and the kernel and the librarian and the aperture and the firewall and there's a bunch of interactions going on we did like probably half a dozen these diagrams to make sure we could actually implement things and it shows the data flows between the various things this was way more visible in a presentation I did earlier with a with a better projector and I apologize for this projectors shortcomings but that's kind of the development methodology we use we put together a bunch of a bunch of short vignettes of how the file system might be used and went through the process of developing the data flows to make sure what actually worked so within that flat so now that we've constructed this file system that goes across the all of the nodes we have these other libraries who are emerging in what is an application see in the machine how does an application work well we have a bunch of different libraries we have the the PMM stuff that entails working on we have an Atomics library that I'm going to talk about in a minute we have the regular POSIX API so A+ Kaplan can open up open a file in this in the library and file system and that bit mutate bytes and it's all pregnant Saul store globally globally visible across the entire fabric we had this new data sharing library called our VMA and of course they have a new little library for doing manipulations that are weird on the light on the library and file system it's a persistent shared file system it's not quite a POSIX file system so there's couple operations we need to expose to applications and we're doing it at with a new library because that's how we do things of course we have the physical address mapping problem our tiny address space in the SOC it's only 48 bits of virtual address space and only fully and it only as much as 48 bits of physical address space the SOC caches are physically tagged the problem with that is that our address space is larger than the physical address based the machine which means that whenever I want to change and talk change the physical address space to talk to a new thing with that with that physical address to LZ a LZ address mapper every time I want to change that I have to tell the processor oh by the way your physical address space has changed what does that mean from the processors perspective well the processor has no idea how to deal with this you go up to a processor vendor and say hey I want to be able to change in physical addresses underneath the processor the processor vendors like ah you're gonna have to flush the cache so you ask the processor there ok so how do I flush the cache and on the arm 64 architecture it turns out the way that you flush the cache is you flush the cache in every single core in the processor when your processor has dozens of cores that takes a while then you ask the processor vendor ok so what can I do this in parallel and the processor vendor says no I'm sorry you can't do this in parallel because the ARM architecture allows the cores to share to transmit cache lines between the cores laterally and not flush them in the meantime it's a performance improvement right that's what processor processor architects are all about so as one process as one core flushes its cache it made a cache line from another processor may get migrated over after its passed over that address that part of the cache and in this process at this core will flush its cash and this cash sign won't be there anymore because it got migrated over so this cash line will never get flushed so I actually get to bring the entire processor to a halt and carefully serialize the flushing of every single core so we're desperate to avoid that yeah it takes a long time um eventually eventually somebody's going to build us a processor with enough physical bits we'll be able to get rid of this so for the initial prototype it was kind of going to be this wall right performance is going to be amazing until you run out of physical address space that all of a sudden is going to be a cliff so it's like writing a physical memory except now we're running our physical address space so it may be it may be that there it may be that some applications are as fast on this hardware's we'd like but you know you you you you build what you build the system you can and it's in it it I think it's going to still be an interesting interesting result we're pretty close to the amount of physical space that we need 48 bits it's not a long ways from from 320 terabytes ok so why do we need Atomics why I told you that the caches are not coherent between the nodes on this machine within a single processor of course all the memory within that and it's visibility into our shared memory that's all coherent but outside of that there is no coherence and so we actually put into the fabric itself atomic operations there's an atomic swap an atomic add an atomic test and set fairly straightforward atomic operations those are in the memory fabric itself so those have to be exposed up to user space with a library through a device driver which is kind of painful so we built a little library that exposes all this stuff we added some more operations and the awesome part about doing an FPGA for your memory controller is it oh you know I'd really like to have this other operation in there and the FPGA can just get reprogrammed so we may change what stuff we're done in hardware or what stuff is done in software over time so this makes it look like you have atomic operations at the CPU level but unfortunately we can't use the CPU instructions so we have a library yeah I talked about that and here's kind of what the library looks like you you register a pile of memory and say I want to be able to do Atomics in this space and then the library does the appropriate syscalls in order to talk to the device underneath so we also need to do cache management because we're not cache coherent and because we have we'd like to get we'd like to be able to flush things out of the local SOC and into the shared onto the fabric so that we can share data with other SOC is non cache coherent so we're using using Intel's lib p.m. Intel's the PMM has the notion of persisting memory which is to say getting out of the processor and into the fabric but it doesn't have the notion of in validating your memory which is to say oh whatever you've got in your local caches is not valid anymore so we've added some simple extensions the PMM library to say whether that address space needs to be invalidated if you're communicating with another SOC so we have this invalidate edition of library and I'm sure Intel will love to take that into their library sometime the other thing of course we've done with Lib pmm is poured at the arm 64 which is kind of an adventure um and the other problem that we have is we have a lot of memory 320 terabytes of memory in a system it turns out that you're going to get errors in that who would have thought oh the memory controller that we're putting into these things has an amazing amount of ECC but even still we expect to get memory errors on occasion really errors are fairly easy to manage from an application perspective you try to read from memory and says I'm sorry I didn't remember what you told me to remember and you get a synchronous error back to the memory controller and so you can just kill the kill the process with a sig bus or whatever you want to do that's pretty easy to manage right errors are harder remember the cache is store the memory for days and so you have no idea when your memory is going to actually get try to actually get out to the fabric and hit the memory that's broken but act-- locations need to know about the failure so what we're doing is we're using the Lib PM p.m. drain as a barrier that's the function that Intel puts their x86 P commit instruction into so we're going to add a bit more logic in our in architecture so that when you call that we're actually to capture all the errors all the right errors that may have happened in memory and send your application as sig bus now one of the interests one of the useful architectures the distinctions in the machine is it has this 256 gigabytes of local memory to run the operating system out of and then the 320 terabytes of shared memory is something that only applications use we're not putting any kernel data structures there so when you get a memory error on the fabric we don't have to worry about the kernel being corrupted all we have to do is worry about user space being corrupted so we can actually kill individual applications and processes instead of having to kill the entire machine so the goal is to be able to have a smaller unit of kind of a field replaceable piece of software so that when that's when the memory error occurs that software can be restarted and not have to restart the entire machine and that's the goal of these changes the other thing of course is because we have a fabric we have different locations that there's memory located all over the fabric so you want to be able to say well I want to have some of my memory stored at the bottom of the back and other stuff stored at the top of the Mac so this part of the Rack dies I'll be able to recover my duplicate data from the top of the rack so you want to be able to tell the system where to allocate it so we have an allocation a way of telling the system where to put the each allocation using some extended attributes and a lot in the librarian file system it's a pretty straightforward plan and here's our remote virtual memory access library I talked about so you have we have all these API is that your data sharing across the network those that data sharing across the network is horribly inefficient when you have something as fast as the machine so we're constructing an API underneath all of those called our VM a that lets you use a network for for prototyping application in a regular cluster but that when you move it over to the machine we're going to take advantage of the of the higher performance memory interconnect to move that memory much more efficiently so we're providing that abstraction the lets you do development with your existing eco your existing hardware environment and then you get better performance when you move it into the machine here's our next generation software architecture for the system software for the machine we're adding a bunch more stuff this is the stuff I just finished last Friday specifying and we're busy working on actually implementing bits of this now yeah so one of the things we discovered is we when the machine that we're building right now doesn't have persistent memory because I can't afford that much and vdm so we're just going to put DRAM in it turns out the 320 terabytes of DRAM you can put it in a relatively small space but it turns out that when you turn the power off it forgets thanks and so what we need to do is we need to be able to turn the power off so we can like move the Machine or change its power configuration so we need to be able to take all that data the Machine and store it somewhere well it turns out that 320 terabytes of data it takes a long time to pump over a single network connection so I actually spent a bunch of time and developed it kind of complicated and crazy architecture for backing up all this data I can actually put the picture of the hardware in here I think I do put these in the wrong order sorry about that so here's a hardware that we've built we have these external storage servers and we have this massively high performance network switch and then we have the instance of the nodes of the machine so each of the nodes has a network connection to the switch and then each of those switch has connections to each of the storage servers and so that the plan is to take each of those SOC s and pump their data over the over the network to these storage servers it turns out to be a huge amount of work so here I have a picture the architecture for that we have the librarian telling everybody where all the memory is located and the each of the node of the machine busily taking its little slice of the data and pumping it over the over the network to the storage cluster so we don't know if this architecture is going to be useful going forward we know we need it today but it's kind of interesting we're building a bunch of infrastructure just to just to satisfy the requirement that we have the RAM in the machine today and it may be useful later for data ingest in a jest and who knows so that's a bunch of stuff that we've added we also have a I want to talk about one final thing ok so the machine doesn't exist today we don't have any hardware that does this but we'd really like to be able to do all of our software development so we built a couple of different simulation environments one of which is free software one of which is kind of an internal architectural simulator for low-level hardware stuff I wanted to kind of give you a preview of what this is so we're doing this emulation we call it the fabric attach memory emulator and it provides a synthetic view a synthetic machine for doing current development then we have a simulator which does register level stuff then of course we'll have the actual machine so the goal is to be able to take your software and develop it on the emulator and then run it on the machine eventually because we want to be able to do Hardware hardware software code development so while we're developing the hardware we're getting the software ready so here's the fabric attached memory emulator and the fabric attached memory and the emulator runs on a regular computer so you can run this on ace what's this dragon Hawk called superdome X thank you and brain is not working today so we have this piece of hardware that has 16 processors in 24 terabytes of memory which is kind of a pretty good simulation machine it's single system image and it has cache coherence across all of this but at least it's a reasonable lot of memory and a reasonable number of processors so what I do to simulate the machine is I take each one of those cores on a little piece of memory and I create a little pretend computing node out of it using virtualization I stack up a bunch of these virtual nodes and then I allocate a big pile of memory out of the out of the underlying hypervisor the Linux kernel running KVM and I just share it among all those nodes and so now those nodes have shared access to this big pool of memory looks a lot like the machine except it's virtualized and so now I can run all by simulation stuff in that and provide a kind of a synthetic environment for the machine so that's how we're doing our development the fabric attached memory emulator is actually a little piece of it that shows how this gets set up is available on github right now and there's a QR code with the link I think the actual link is visible in this but just barely that's kind of a picture of how it works so we have a bunch of qemu instances running running our version of Linux for the machine and then they have the shared memory backed by the hosts file system so you actually get persistence which is kind of cool and so all those instances they are all those instances of Linux are actually able to share that and make it look like our underlying memory fabric we're actually doing a bunch of interesting research on this hardware and discovering that there are a bunch of algorithms who are able to run on this hardware faster than a cluster so we're actually taking a bunch of the technologies that we're doing for the machine and productizing on the superdome x hardware directly which is pretty cool kind of early access to some of the research here's what we're working on free software the persistent memory library stuff direct access stuff concurrent distributed file systems we'd love to have those trying to figure out how to get cat non cache coherent systems working again well that's a challenge we haven't done that for a long time another big piece of the puzzle of course is this Rass and large memory systems you've got a huge amount of memory all the RAF's work we've done for persistence in the past has all been based on based on block storage where now I have byte storage and rafts in a byte storage environment is very different from whereas in a block storage environment so we're working on that as well hey guys I know I'm running out of time and I want to thank you all for participating this morning and coming to LCA as always I love coming down to the southern hemisphere in the dead of winter and even though the weather doesn't seem awesome to you let me tell you it seems awesome to us and thank you very much for your time and attention this way you
Info
Channel: Linux.conf.au 2016 -- Geelong, Australia
Views: 25,820
Rating: undefined out of 5
Keywords: lca, lca2016, KeithPackard
Id: S--Kgseuy0Q
Channel Id: undefined
Length: 45min 28sec (2728 seconds)
Published: Fri Feb 05 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.