Types of PDF - Computerphile

one of the phrases that's become common parlance if you like since the launch of pdf in the early 1990s is it's okay i'll just send you a pdf and a lot of people don't realize what a broad church that is it covers a multitude of sins yes you may get an ultra high quality thing that could be printed in vogue magazine in 23 different colors on glossy paper or somebody may just say oh you all you want is a copy of this of some sort i'll just stick it in my scanner and send you what amounts to a jpeg scan of the page but since you've asked for pdf i'll use the fact that pdf seems to be able to do it but it doesn't upgrade it it just sends you a bitmap scan only in pdf format what should we be saying in addition to i'll just send you a pdf to reassure people that the quality will be there somehow if you want that quality to be part and parcel of your story i'm going to illustrate this in the end by just showing you an original bitmap pdf of dennis richard's thesis and when you look at it the scan quality is very variable but you think yeah typical jpeg but it isn't its pdf so it's a broad church you can have many sorts of pdf so let's just run through three very important types of pdf if you're a purist you may be thinking well a pdf of course has presumably been done on a professional piece of software like you know adobe indesign so it'll be sharp and perfect and everything and if you have high expectations and you want your pdfs to be infinitely magnifiable up to barn door size and have beautifully rendered colors well then this aim in life i think adobe used to call it pdf normal and my ex grad student matt hardy still there with adobe hello matt he said dave people don't understand what pdf normal means so i said well what do you call it then he said oh well we use something like pdf full text and graphics in other words you've tried to choose the features in pdf that enable you to render that sort of material at the best possible resolution if it's ftg you've got to use proper fonts for characters if you draw line diagrams you've got to use something like adobe illustrator to do them you must draw a line as a line not as a set of dots you must draw a circle as a circular curve not a set of dots and you resolve down to if you like color to jpeg only for things that really are like that photos scenery people whatever that's absolutely fine but if people are in a hurry or don't have all of this sophistication the other end of the spectrum is you can basically use um a scanner to scan in a copy of your high quality document rather than regenerating it from source you might say oh i can't be bothered i'll just scan it now if you just scan it in that's fine it could be a good quality scan a bad quality scan and just like with jpeg tiff ping all of these bitmap standards you may have heard of and used for your photography pdf has its own bitmap graphic capability that subsumes all of those so you could send somebody a pdf which basically is little better than a jpeg one of the things i'm finding in fielding queries is that somehow and it's great pdf is a symbol of quality you know i asked people to send me a pdf and all i got was this lousy bitmap something like that and the answer is yes pdf will cheerfully render lousy stuff if you give it that pdf i for pdf image only you might think well those are the two extremes then everything thoroughly retypes it using appropriate tools or at the opposite extreme everything's a photo everything's a jpeg everything's a bitmap it's just we're calling it a pdf bitmap and somebody at adobe said you know could optical character recognition ocr be of any help to us in the late 80s early 90s adobe bought a boston based uh graphics and ocr company whose name i've now forgotten but it was an excellent purchase because basically they said if you have got high quality bitmap scans and you don't know what made them we will have this wonderful thing called acrobat capture which will enable you interactively to rebuild to a very high standard documents that were done in a professional app that you don't have and it was very good product the idea was you had a room full of people hundreds of them even you had a capture server so you could share the task of recreating a 10 000 document kennedy library all brought back to life with excellent precision it was too expensive the number of libraries that will rebuild in an entirety a whole collection at super high quality is very very limited but somebody somewhere i don't know who it was but they deserve a medal said ah but what this will be ideal for is that in court judges love having xeroxes and they've been told not to call them xeroxes because there are other photocopier companies so they start give me a photocopy and they love them so what we want from pdf is to retain the roughness of the original but somehow magically to ocr it and make it at least slightly searchable and somebody could say to a yes your honor you're on page three the word reprehensible will appear on line three so just type in rep oh it's found it you see well if you've got a page full of bitmap how the heck do you find it the answer is you have in some limited way to do ocr on it to see if there's anything there that corresponds to text so the middle relay was in many ways i think the saving of pdf for large-scale commercial applications this wonderful compromise of saying we'll give you a bitmap overlay done with our technology but underneath that we will try our best but we won't go completely mad maybe 70 to 80 okay from a good scam we'll try and give you some good searchability on this and not totally perfect we're moving on now from saying well we can't do ftg image only is not good enough quality we want more value to the rescue comes pdf image plus hidden text and the idea is that you recognize as much as you can you recognize where on the page that phrase is you have a guess at what the font is using your best quick and dirty technology and you say i can now make this searchable insofar as my ocr makes sense ask me for hello world as long as i've recognized it i will highlight for you back on the bitmap yeah i won't just give you a listing saying memorable phrases i discovered in this bitmap i'll actually highlight them for you because i know where they occur that's a very long story short it was a howling success it really was the idea of having a bitmap that clearly hadn't been messed with except it had but which her honor could actually get to see the relevant phrase in context highlighted on the bitmap was an absolute winner you know in the legal community let's go and find the workstation where i'm currently trying to rescue dennis rich's thesis and to see what versions of it i currently have available this office this is my linux box running debian linux so this is back to 1972's software which i just happened to have hung on to for projects of this nature what you're seeing here is the end result of me rescuing this page this is my pdf ftg i am using exactly the right tools for properly re-type setting a version of this material so you go there but this material is mostly just rather complicated oh look at that hideously complex mathematical equations you do not want to type that out character by character by hand even if you know what it means so what i can do and it should scale up reasonably well if i say here on the preview that i'd like to can you see i've asked it to be eight times magnified but look the quality is still there in fact i think i can go one more on this preview software innately the postscript pdf imaging model is capable of scaling stuff up that has been properly typeset it knows how to do it now just for contrast let's take a look at what the original looked like now you can tell straight away that's a photocopy also old-timers in there will immediately seize on that and say that's a tally tight print out and it is for some reason dennis did the cover page on a teletype but when you get to the first real page as far as we know and can tell that is what an ibm selectric typewriter can do if you magnify this not very good is it look at that can you see it's all breaking up i mean this is by no means the worst page of course we say clearly image only is not good enough for us it's not searchable so next people's exhibit number two then is let's from here kill off that particular pdf richie dissertation ocr.pdf let's see where we get to it looks the same and of course the page bitmap overlays are the same but sneakily and after several minutes of processing it's hidden searchability underneath so let us now go on page one and let us say at the wall we want to just find the first the in this document oh it found one even earlier of course i forgot that the default mode of this is not case sensitive it's founded there but with a capital t what i've actually managed to do is to get a searchable bitmap version of dennis's thesis and that's absolutely ideal to work from because i as a human can read it and if i say hmm i'm sure we came across the word submarine on this page i can try and find it and i very likely will so there you are then we've you've seen totally reset material which is arbitrarily scalable and doesn't crack up this is clearly not a proper typeset quality even at its native size here you can see there's dots missing in the typeface and and it's just not of that crisp high quality look that one so searches after when you're outputting a pdf you should be trying to output a pdf with full text and graphics and treat everything the way it deserves to be treated you should not be taking shortcuts saying i don't have enough fonts i'll just bitmap that text the virtues of ftg not only include quality but they include space type setting text from within a font takes a lot less space than reproducing all the dotty bitmaps of your broken letter t's and all this kind of stuff equally you don't want to subject your graphics to being a photo with lots of white pixels in there that have to be compressed out and causing grief all over no if it's a straight line you just say line and that gives you the appropriate command in pdf to do a line you should aim for the highest quality that your audience or that the application demands and just be aware that there is this intermediate stage which i found very useful is to look at a moderately okay quality bitmap as my guide to what i need to fix and re-type but at the same time to be able to find things inside it just about we have to decide how much down sampling we think we can get away with in general it's very common to down sample the color by a factor of two in both directions so essentially you have four times less colors and they all end up in the neck or the throat or whatever it's called and on a conventional typewriter
Channel: Computerphile
Keywords: computers, computerphile, computer, science, University of Nottingham, Professor Brailsford, PDF, Document Engineering, Text, DFB, DMR, Dennis Ritchie, Unix, Thesis
Published: Fri Jun 18 2021
