All Rust string types explained

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
why is it that in C strings are simply an array of characters but in Rust strings are represented by all these different types many of us beginners see this as unnecessary complexity but the rust team was very intentional about how strings are designed in Rust and it has everything to do with safety efficiency and flexibility in this video I'll explain all the different string types in Rust and how each of them plays a significant role in building blazingly fast thrust applications towards the end I'll also cover a couple rare specialized string types you can use to squeeze out maximum performance now to truly appreciate how rust handles strings it's critical to First understand what strings fundamentally are in programming data is fundamentally represented as binary ones and zeros for a program to convert this binary data into a human readable string it needs two key pieces of information the character encoding being used and a way to determine the length of the string let's talk about character encoding first binary data is typically process as a sequence of bytes each byte containing 8 Bits and a byte can be represented as an integer using the binary number system now a character encoding is simply a standard for mapping bytes to characters there are two very common encodings that you need to know when dealing with strings ASCII and a utf-8 ASCII stands for American Standard code for information interchange it's a very simple encoding where each character is represented by one byte using ASCII we can map the first byte in this example to a capital H and the second byte to a lowercase e and if we continue on we can encode the entire hello world string in ASCII this works well for simple English strings however ASCII is very limited a single byte can only represent 256 distinct values as a result ASCII only supports the English alphabet symbols and control characters it doesn't support other languages or complex characters like emojis that's where utf-8 comes in utf-8 is a variable with encode coding where characters are anywhere from one to four bytes this means that it can encode over a million characters including every language in the world and complex characters like emojis it's also completely backwards compatible with ASCII all these great features is why utf-8 is the most widely adopted character encoding in the world now that we know how to convert bytes into characters let's discuss the other important piece of information programs need to represent a string its length a string is a sequence of bytes that lives within a larger block of memory when you create a string variable in your program it points to the first byte of the string but how does the program determine where the string ends there are two main approaches the first approach is to use a termination character commonly a null byte to Mark the end of a string this approach is simple and saves memory but it has a runtime cost for certain operations for example to get the length of the string you have to Traverse the string bite by bite until you get to the termination character the second approach and involves storing the String's length along with a pointer to the first byte of the string in a higher level data structure the benefit is some runtime operations will be faster for example retrieving the String's length can be done in constant time however this approach does use some additional memory with this fundamental understanding of strings we can now explore why strings and C are really simple but also really dangerous and then talk about how rust addresses these issues in C strings are simply represented as an array of characters or a pointer pointing to the first character in the string a null Terminator is automatically added by the compiler to Mark the end of a string and C does not enforce any particular encoding this Simplicity comes at a price developers are responsible for making sure strings are valid and handled properly which often times leads to disaster for example let's say your program took in some user input and you expected that input to be valid utf-8 if you forget to validate this input or your validation isn't done properly it could lead to data corruption or Worse security vulnerabilities let's take a look at another example say you have a string and you want to copy that string so you create a buffer with 16 characters which is the amount of characters your string has and then use the string copy function to populate the buffer turns out this code causes a buffer overflow which can lead to data corruption undefined Behavior security vulnerabilities or system crashes and that's because we forgot to account for the nobody character the compiler automatically inserts at the end of our string now we can avoid this mistake by using the size of function when creating a new buffer but the point is it's really easy to make catastrophic mistakes and C now that we've seen how dangerous strings in C can be let's turn our attention to rust a language that reimagines string handling with safety efficiency and flexibility in mind rust leverages its powerful type system to ensure string safety in three key ways firstly string types and rust store the string length as metadata instead of using a null Terminator this leads to more efficient runtime operations and prevents vulnerabilities like buffer overflows secondly strings and rust are guaranteed to be valid utf-8 this ensures that strings are compatible across languages and systems while also preventing issues like data corruption and it makes it easier to work with strings because developers don't need to think about the encoding thirdly strings in Rust are more generally variables in Rust are immutable by default this helps prevent issues where the contents of a string are changed unexpectedly now there are many ways to represent a string in Rust and will cover all of them in this video but at the core rust has two primary string types strings and string slices understanding these two types and their differences is critical because they cover 90 of use cases and you'll be working with them all the time if you take away one thing from this video it should be a deeper understanding of these two types so let's go over the technical details of each type and their use case the string type in Rust is a heat allocated growable utf-8 encoded string this is called an owned type because it owns the underlying data and is responsible for cleaning it up when the string variable goes out of scope the underlying data is automatically deallocated this type consists of a pointer to the string data on the Heap the String's length and its capacity this design allows it to be efficient for string manipulation a string Slice on the other hand is a view into a string it represents a contiguous sequence of utf-a encoded bytes making it efficient for read-only operations this is called a borrow type because it doesn't own the underlying data it simply has access to it unlike the string type string slices in most cases do not own their data they are essentially a reference to a segment of a string or another string slice holding only a pointer to the start of the slice and the length of the slice unlike the string type string slices don't contain capacity information because they're not growable another difference is that while the shrink type is always allocated on the Heap string slices can reference data on the Heap or in the data section of the compiled binary which is the case for string literals or strings stored on the stack which is rare but possible these two types have distinct use cases the string type is useful when you want to create or modify string data dynamically at runtime for example when reading and altering file content or collecting user input string slices on the other hand are useful when you want to read or analyze pre-existing string data without making changes to it for example parsing command line arguments or searching for a substring within a larger string so far we've discussed how rust ensures string safety by storing length metadata instead of using a termination character enforcing utf-a encoding and making strings immutable by default we also covered the two main string types in Rust strings and string slices now it's time to buckle up because we're going to cover all the other string types in Rust which enable efficiency and flexibility being aware of these type types is important so that you're not caught off guard when you come across them and truly understanding these types will allow you to utilize the full power of rust first let's talk about the different variations of string slices here we have a string literal which is a reference to a string slice this is actually syntactic sugar for a reference with a static lifetime a static lifetime indicates that the data being pointed to is guaranteed to be available for the entire duration of the program's execution this makes sense for string literals because they're stored in the compiled binary now most of the time you don't have to explicitly write out the static lifetime but there are cases where you do have to write it out for example when storing string slices instructs or enums in this example the parse error variant stores a string literal or when returning a string slice from a function that has no other borrowed parameters now you may have noticed that the string slice type is made up of two parts the reference operator and the stir type the stir type represents a dynamically sized sequence of utf-a encoded bytes in other words start describes a string slice but we can't use it directly as a standalone type because its size is not known at compile time instead we have to use Stir behind some type of pointer like a reference this is by far the most common string slice type you'll see but it's not the only one let's explore three other pointer types we can use with the stir type for specialized cases instead of using a reference we can wrap the star type in a box smart pointer this type represents an owned non-growable Heap allocated string slice it's useful when you want to freeze a string to prevent further modifications or save memory by dropping the extra capacity information the string type stores in this example we create a string type and then turn it into a box star type to indicate that we want to keep the string as is without further modifications this saves a small amount of memory by dropping the capacity information the string type stores in real world code you might use the Box Store type in cases when you're working with apis that need to return an own string that will not be modified further or when you want to aggressive optimized for memory usage and you know the string will not change you can also use the stir type with the reference counting smart pointer which is useful when you want to share ownership of an immutable string slice across multiple parts of your program without cloning the actual string data for example let's say we have a large string representing some text and multiple parts of our program want to hold references to a particular section of the text to avoid copying that subsection we can use an RC store type the actual string data is only stored once in memory regardless of how many RC instances we create this could be beneficial when you're dealing with really large strings that would be expensive to clone the counterpart to the RC smart pointer is the arc smartpointer which stands for Atomic reference counted unlike the RC smartpointer Arc is thread safe wrapping the store type in this smart pointer is useful when you have an immutable shrink slice that you want to share across multiple threads without having to clone the string data in this example we create a regular shrink slice and wrap it with The Arc smart pointer then we can spawn on three threads which can all read the slice without having to clone the string data itself now that we have a better understanding of string slices let's dive a little deeper into the string type the string type is essentially a wrapper around a vector of bytes the difference being those bytes are guaranteed to be valid utf-8 this allows the string type to provide methods that make it convenient to work with Unicode text and it also enables safe manipulation of string data however representing a string as a vector of bytes or a slice of bytes can be useful when dealing with binary data constructing strings byte by byte or when dealing with strings that use an encoding other than utf-8 in this example we're calling the read Latin one string function to simulate reading a string with the Latin one encoding from some external Source like a binary file or network packet this function returns a vector of bytes then we call the Latin one to string function and pass in the vector of bytes as a slice of bytes this function converts the Latin one encoded string to a utf-a encoded string as you can see binary string representations are useful when dealing with non-utfa encoded strings now let's switch course and talk about something that you'll likely see pretty often in Rust code and that's different string literal representations specifically raw string literals and byte strings here's the string literal in Rust if we wanted to include special characters like double quotes or backslashes within the string we would need to escape them with backslashes this becomes tedious in certain cases like writing regular expressions or defining Json objects as string literals in these cases we can use a raw string literal by prefixing the string with a lowercase R and adding a hash symbol on either side of the string raw string literals allow you to write special characters like backslashes and quotes without needing to escape them here we can see a raw string literal being used to create a regular expression pattern byte strings on the other hand are created by prefixing a string literal with a lowercase b this creates a slice of bytes which is useful for dealing with network protocol calls that expect a bite sequence like the HTTP protocol you can also combine raw string literals with byte strings in this example we Define a raw byte string containing the PNG file format signature which we can use to identify PNG files we just covered string literals which are straightforward and common in Rust but what if you have more specialized needs or constraints what if you want to squeeze out every ounce of performance that's where some of us lesser known string types come into play Let's dive into these hidden gems and see how you could take advantage of them shrink slices are most often represented like this an immutable reference to a sequence of utf-8 encoded bytes however it is possible to create a mutable reference this allows you to directly modify the contents of a string slice while ensuring memory safety and utf-8 compliance although rare this is useful for In-Place string Transformations without needing to allocate new memory for a separate string in this example we have a function called anonymize emails which takes a mutable string device as input and uses a regular expression to find email addresses within the string and then replace them with asterisks note that we're using some unsaved code here and that's because we're calling as bytes mu which returns a mutable byte slice it's our responsibility to ensure that those bytes are valid utf-8 even after being modified mutable slices are generally avoided in idiomatic rust code due to the complexities and potential pitfalls around ensuring that the data remains valid utf-8 however you may see this type used in low-level libraries or in code that needs to be aggressively optimized another specialized string type you might come across is the cow enum cow stands for copy on write this type is useful when you have a function that sometimes modifies a string and other times doesn't and you want to avoid making a new allocation in cases where no modification is necessary for example let's say you have a function that takes a string and returns a sanitized version of it if the input string doesn't contain any blacklisted words then you can return it directly without allocating a new string in this case we're returning the cow borrowed variant which is essentially a zero cost operation otherwise we create a new sanitized string and return that in this case it's the cow owned variant so far we've discussed how rust utilizes its robust type system to ensure string safety we've also explored the two main string types in Rust and examined a variety of other string types designed for efficiency and flexibility now let's talk about a special group of string types that deal with interoperability these types abstract away differences between operating systems and help connect your rust code with other languages for example the OS string and Os store types in Rust are useful for handling strings in a way that is compatible with operating systems unlike strings and string slices which are guaranteed to be utf-8 encoded OS string and Os stir can contain any sequence of bytes on unix-like systems or any sequence of 16-bit values on Windows this is useful when interacting with system calls that don't require strings to be utf-8 encoded for example let's consider file operations when reading the names of all files in a directory you can guarantee that every file name will be valid utf-8 using OS string you can read these names even if they contain invalid utfa sequences this allows you to handle non-utf-8 file names gracefully if the conversion to a regular rust utf-a encoded string fails we can still handle the OS string value as needed this is especially important for writing cross-platform code as different operating systems have different requirements and conventions for Strings path and path buff are specialized strings in Rust for dealing with file system paths a path is an immutable view of a path similar to a shrink slice it's used for reading or inspecting paths and a path buff is a mutable and owned version of a path similar to the string type it's used when you want to create or modify paths these types are useful for interoperability because operating systems handle file paths differently in this example we use path and path buff to read the contents of a file first we use the path type to reference a directory and then we pass it into the read file function which uses the path buff type to construct the full path to the file within the directory we then pass this full path to the file open function lastly we have the c-store and C string types which is useful when your interfacing rust code with C libraries that expect null terminated strings these types provide a safe way to handle c-compatible strings for example let's say we wanted to call the get environment function from the C standard Library which fetches the value of an environment variable this function accepts a C string as input and returns a C string as output to call it we first create a null terminated C string containing the environment variable name path then we call the get environment C function using a pointer to the C string the this function returns a pointer to a null terminated array of characters we take this pointer and convert it to a c star instance and finally we convert the C stir to a regular bus string slice ensuring that the data is valid utf-8 by using C string and C stir you can safely pass string data back and forth between rust and C functions ensuring that the null Terminator expectations of c are upheld we've covered a lot in this video so let's do a quick summary rust ensures string safety in three key ways firstly rust string types do not use a null Terminator instead the String's length is stored in the type secondly strings and rust are guaranteed to be valid utf-8 and thirdly shrinks and rust are immutable by default rust has two main string types string is a heap allocated growable utf-8 encoded string it's an own type meaning that it's responsible for cleaning up the underlying string data which is done automatically when the string variable goes out of scope the shrink type is used to create or modify strings at around time its counterpart are string slices a string slice is a view into a string or part of a string which could be allocated on the Heap the stack or in the compiled binary shrink slices are represented as a reference to the stir type stir represents a sequence of utf-a encoded bytes of dynamic length because store size cannot be known at compile time it must be used behind some type of pointer which in this case is a reference this is a borrowed type because it doesn't own the underlying string data it simply has access to it string sizes are used to read and analyze strings string literals and rust are string slices with a static lifetime in most cases you don't have to explicitly write out the static lifetime because the compiler will automatically infer it but there are cases when you do need to specify it for example instruct or enum definitions 99 of the time the store type will be behind a reference but you could also wrap it in other pointer types for example a Boxster represents an owned non-growable Heap allocated shrink slide this type is used to freeze a string to prevent further modifications or to save memory by dropping the extra capacity information the string type stores using the RC smart pointer allows you to share ownership of an immutable string slice across multiple parts of your program without cloning the actual string data and using the arc smart pointer allows you to have an immutable string slice that you can share across multiple threads without having to clone the string data strings and rust can be represented as a vector of bytes or a slice of bytes which is useful for non-utfa encoded strings string literals have a few special formats raw string literals allow you to include special characters like double quotes within a string without needing to escape them this is useful when writing regular expressions or defining a Json object as a string literal byte strings allow you to represent a string literal as a slice of bytes which is useful when dealing with network protocols that expect bite sequences like the HTTP protocol and you can combine raw string literals and bite strings to create raw byte strings rust also has a couple of specialized string types mutable string slices allow you to directly modify the contents of a string slice this is useful for In-Place string Transformations without allocating new memory for a separate string another specialized string type you may come across is the cow Eno which stands for copy on right this type is useful when you have a function that sometimes needs to modify a string and you want to avoid making a new allocation in cases where no modification is needed lastly we have strings that facilitate interoperability OS string and Os stir are useful for handling strings in a way that is compatible with the operating system these types are used to interact with system calls that don't require strings to be utf-8 encoded path and path buff are used to handle file system paths in a OS agnostic way finally c-store and C string are useful when your interfacing rust code with C libraries that expect null terminated strings if you want to see more rust content like this make sure to hit the sub subscribe button hope you've enjoyed the video and remember to stay Rusty
Info
Channel: Let's Get Rusty
Views: 153,832
Rating: undefined out of 5
Keywords:
Id: CpvzeyzgQdw
Channel Id: undefined
Length: 22min 13sec (1333 seconds)
Published: Sun Sep 17 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.