LF, CRLF, and Why They Are The Way That They Are

Long ago, in a time that exists only in the faintest wisps of memory1, computers frequently printed (literally) output to teletype machines (sometimes abbreviated as TTYs, an acronym which has long outlived its original meaning). These machines would output text by mechanically printing it on a piece of paper, one character at a time. They would be sent a stream of data to print in a format like ASCII2, which would mix literal characters (such as 'A' or ';') with "Control Characters" - instructions which would cause the printer to perform some specific action.

Some of these control characters, such as "horizontal tab" (ASCII 0x09), live on and flourish today - you may have opinions about whether to use spaces or tabs for program indentation, but you can't dispute that plenty of people use tabs. Other characters, such as "vertical tab" (ASCII 0x0B, which would advance the paper vertically by some nonstandard number of lines), are functionally extinct in the wild, and mostly only exist as weird gaps in the ASCII standard (which is also the first 128 characters in the UTF-8 standard, which is the default text encoding used just about everywhere today). But there's one control character out there which persists in a weird half-life, neither fully alive nor fully dead, occupying space in countless trillions of files but adding meaningful information to almost none of them.

I'm talking, of course, about the carriage return (ASCII 0x0D), which you've seen represented as \r or CR.

In physical teletype machines, CR had a very important meaning - "return the carriage", which is to say, "move the paper back to the start of the line". This was obviously a necessary command to execute between each line of text, to avoid printing letters on top of each other, which would be bad. Therefore, carriage returns were normally paired with the "line feed" character (ASCII 0x0A), which you've seen represented as LF or \n, to instruct the printer to realign the paper and then start printing at the beginning of the line. In fact, it was important to place the CR before the LF when breaking a line, because many teletypes could take a while to complete the "carriage return" action, and if you tried to print literal characters before it was completed they would smudge across the paper instead of printing where they were supposed to.

Because most computers before about the late 1970s3 could be expected to print some or all of their output on teletypes, the character sequence CRLF became ubiquitous in text file encodings to mean "line break", as in "literally begin printing on the next line". Even after teletypes disappeared from the computer world, many operating systems kept this standard for various legacy reasons.

One of those operating systems was CP/M, which was a commercial success in the late 1970s. In the early 80s, a guy named Bill Gates4 created a competing operating system called MS-DOS which was designed to be fully compatible with CP/M. This included line endings. MS-DOS was used as the foundation for an operating system named Windows, which you may have encountered a time or two. Windows continues to use CRLF as the standard for line endings even today.

But back in the 1970s, another operating system appeared with a different convention for line endings. Multics, an early and influential time-sharing OS, decided that encoding a line break as CRLF was a waste of a perfectly good byte, and decided to encode line breaks as LF only. To handle legacy teleprinters, Multics drivers were smart enough to insert a CR before every LF when they were sending text to a teletype output. Multics fell out of common use fairly quickly, but managed to inspire a new generation of OSes, particularly Unix which kept the line break convention. Unix, of course, is the basis of Linux, BSD, and a whole host of other operating systems.

(As a side note, early Macs also decided to optimize the line break sequence to save bytes, but did so by keeping the CR instead of the LF. However, at some point, they switched to the Unix standard LF)

Fortunately, we live in a world where we mostly don't have to think about this. Just about every modern program to display textual data can interpret any line ending convention you give it correctly5 and programs which may have to interoperate between operating systems, like Git, are pretty good about converting line endings behind the scenes so you don't have to worry about it.

However! As with every abstraction, things can sometimes get a little leaky.

A few weeks ago, I found myself in a situation6 where I had to add some text to the headers of a bunch of files in a large .NET solution. I used the CLI utility sed to accomplish this, which worked like a charm--or so I thought. Eventually, I went to run unit tests on my code locally, and a bunch of them started failing, mostly in places I hadn't touched. Even worse, the failures all looked somewhat like Assertion failed! Strings were not equal: expected 'some long string literal', received 'some long string literal'

This took me an embarrassing amount of time (and a whole lot of swearing at my computer, which fortunately my wife is very patient with) to diagnose. As it turned out, what had happened was that when I had run sed against all these files on my Windows work machine, it had (as a Unix command-line utility) replaced every CRLF with an LF. These changes never showed up in my Git diffs, because Git is smart enough to ignore line-break changes. They hadn't showed up in my IDE, because Rider is happy to work with files with any line-break convention. But they had broken a handful of unit tests, because we had a few places where multiline string literals were being compared (as expected values) against multiline strings generated by some particular piece of our code, or other similar situations.

Once I had figured this out, it was a simple matter of running unix2dos to convert the line-breaks back in place7. That solved the issue in a few seconds, if you don't count the morning I wasted trying to figure out why the {unprintable control character sequence} my unit tests were failing when the strings looked exactly the same.

So that's the story of how the long-lost existence of teletypewriters caused me some brief but intense problems in 2023. But hey, at least now I can add a few more obscure and arcane facts to my great mental library of such knowledge, some of which is useful from time to time.

Sources and further reading

  • I found out about the history of line-break conventions (among many many other fascinating historical tidbits) in this ESR post about Things Every Hacker Once Knew.
  • There's good further reading about character encodings in this post from 2003 -- fortunately most of the problems he complains (justifiably) about in the post have been solved since then with the widespread adoption of UTF-8.
  • A super-quick overview of the general ideas discussed here, if you want the condensed version.

  1. The mid-1900s. 

  2. Which rapidly became the dominant format, but was by no means the only one

  3. Per Wikipedia

  4. You may have heard of him; he's actually terrible for numerous and well-documented reasons, all of which are beyond the scope of this post. 

  5. One of the last holdouts was Microsoft's Notepad, which finally fixed support in 2018

  6. It's not really important to the story, but I was trying to add using statements (C# for import) to the tops of ~1000 files, and it turned out to be easiest to add them to the top of every file and then use the refactor utility in Rider to clean up all the ones that didn't need to be there. 

  7. And then it turned out we didn't need the changes anyway and I got to discard the whole branch and had nothing to show for it but lost time, but that's another story and less interesting. 

You'll only receive email when they publish something new.

More from Tom
All posts