Scripting Languages

Today in my continuing series on programming languages I'm going to talk about "scripting languages". "Scripting" is a rather fuzzy category, because unlike the kinds of languages we've discussed before, scripting languages are really distinguished by how they are used, and they're used in two very different ways. It's also confusing because most scripting languages are interpreted, and people tend to use "scripting" when they should be using "interpreted". In my opinion it's more correct to say that a language is being used as a scripting language, rather than to say that it is a scripting language. As we'll see, this is particularly true when the language is being used to customize some application.

But first, let's define scripts. A script is basically a sequence of commands that a user could type at a terminal[1] -- often called "the command line" -- that have been put together in a file so that they can be run automatically. The script then becomes a new command. In Linux, and before that Unix, the program that interprets user commands is called a "shell", possibly because it's the visible outer layer of the operating system. The quintessential script is a shell script. We'll dive into the details later.

[1] okay, a terminal emulator. Hardly anyone uses physical terminals anymore. Or remembers that the "tty" in /dev/tty stands for "teletype"'.

The second kind of scripting language is used to implement commands inside some interactive program that isn't a shell. (These languages are also called extension languages, because they're extending the capabilities of their host program, or sometimes configuration languages.) Extension languages generally look nothing at all like something you'd type into a shell -- they're really just programming languages, and often are just the programming language the application was written in. The commands of an interactive program like a text editor or graphics editor tend to be things like single keystrokes and mouse gestures, and in most cases you wouldn't want to -- or even be able to -- write programs with them. I'll use "extension languages" for languages used in this way. There's some overlap in between, and I'll talk about that later.

Shell scripting languages

Before there was Unix, there were mainframes. At first, you would punch out decks of Hollerith cards, hand them to the computer operator, and they would (eventually) put it in the reader and push the start button, and you would come back an hour or so later and pick up your deck with a pile of listings from the printer.

Computers were expensive in those days, so to save time the operator would pile a big batch of card decks on top of one another with a couple of "job control" cards in between to separate the jobs. Job control languages were really the first scripting languages. (And the old terminology lingers on, as such things do, in the ".bat" extension of MS/DOS (later Windows) "batch files". Which are shell scripts.)

By far the most sophisticated job control language ran on the Burroughs 5000 and 6000 series computers, which were designed to run Algol very efficiently. (So efficiently that they used Algol as what amounted to their assembly language! Programs in other languages, including Fortran and Cobol, were compiled by first translating them into Algol.) The job control language was a somewhat extended version of Algol in which some variables had files as their values, and programs were simply subroutines. Don't let anyone tell you that all scripting languages are interpreted.

Side note: the Burroughs machines' operating system was called MCP, which stands for Master Control Program. Movie fans may find that name familiar.

Even DOS batch files had control-flow statements (conditionals and loops) and the ability to substitute variables into commands. But these features were clumsy to use. In contrast, the Unix shell written by Stephen Bourne at Bell Labs was designed as a scripting language. The syntax of the control structures was, in fact, derived from Algol 68, which introduced the "if...fi" and "do...done" syntax.

Bourne's shell was called sh in Unix's characteristically terse style. The version of Unix developed at Berkeley, (BSD, for Berkeley System Distribution -- I'll talk about the history of operating systems some time) had a shell called the C shell, csh, which had a syntax derived from the C programming language. That immediately gave rise to the popular tongue-twister "she sells cshs by the cshore".

The GNU (GNU's Not Unix) project, started by Richard Stallman with the goal of producing a completely free replacement for Unix, naturally had its own rewrite of the Bourne Shell called bash -- the Bourne Again Shell. It's a considerable improvement over the original, pulling in features from csh and some other shell variants.

Let's look at shell scripting a little more closely. The basic statement is a command -- the name of a program followed by its arguments, just as you would type it on the command line. If the command isn't one of the few built-in ones, the shell then looks for a file that matches the name of the command, and runs it. The program eventually produces some output, and exits with a result code that indicates either success or failure.

There are a few really brilliant things going on here.

Each program gets run in a separate process. Unix was originally a time-sharing operating system, meaning that many people could use the computer at the same time, each typing at their own terminal, and the OS would run all their commands at once, a little at a time.
That means that you can pipe the output of one command into the input of another. That's called a "pipeline"; the commands are separated by vertical bars, like | this, so the '|' character is often called "pipe" in other contexts. It's a lot shorter than saying "vertical bar".
You can "redirect" the output of a command into a file. There's even a "pipe fitting" command called tee that does both: copies its input into a file, and also passes it along to the next command in the pipeline.
The shell uses the command's result code for control -- there's a program called true that does nothing but immediately returns success, and another called false that immediately fails. There's another one, test, which can perform various tests, for example to see whether two strings are equal, or a file is writable. There's an alias for it: [. Unix allows all sorts of characters in filenames. Anyway, you can say things like if [ -w $f ]; then...
You can also use a command's output as part of another command line, or put it into a variable. today=`date` takes the result of running the date program and puts it in a variable called today.

This is basically functional programming, with programs as functions and files as variables. (Of course, you can define variables and functions in the shell as well.) In case you were wondering whether Bash is a "real" programming language, take a look at nanoblogger and Abcde (A Better CD Encoder).

Sometime later in this series I'll devote a whole post to an introduction to shell scripting. For now, I'll just show you a couple of my favorite one-liners to give you a taste for it. These are tiny but useful scripts that you might type off the top of your head. Note that comments in shell -- almost all Unix scripting languages, as a matter of fact -- start with an octothorpe. (I'll talk about octothorpe/sharp/hash/pound later, too.)

# wait until nova (my household server) comes back up after a reboot
until ping -c1 nova; do sleep 10; done

# count my blog posts.  wc counts words, lines, and/or characters.
find $HOME/.ljarchive -type f -print | wc -l

# find all posts that were published in January.
# grep prints lines in its input that match a pattern.
find $HOME/.ljarchive -type f -print | grep /01/ | sort

Other scripting languages

As you can see, shell scripts tend to be a bit cryptic. That's partly because shells are also meant to have commands typed at them directly, so brevity is often favored over clarity. It's also because all of the operations that work on files are programs in their own right; they often have dozens of options and were written at different times by different people. The find program is often cited as a good (or bad) example of this -- it has a very different set of options from any other program, because you're trying to express a rather complicated combination of tests on its command line.

Some things are just too complicated to express on a single line, at least with anything resembling readability, so many other programs besides shells are designed to run scripts. Some of the first of these in Unix were sed, the "stream editor", which applies text editing operations to its input, and awk, which splits lines into "fields" and lets you do database-like operations on them. (Simpler programs that also split lines into fields include sort, uniq, and join.)

DOS and Windows look at the last three characters of a program's name (e.g., "exe" for "executable" machine language and "bat" for "batch" scripts) to determine what it contains and how to run it. Unix, on the other hand, looks at the first few characters of the file itself. In particular, if these are "#!" followed by the name of a program (I'm simplifying a little), the file is passed to that program to be run as a script. The "#!" combination is usually pronounced "shebang". This accounts for the popularity of "#" to mark comments -- lines that are meant to be ignored -- in most scripting languages.

The scripting programs we've seen so far -- sh, sed, awk, and some others -- are all designed to do one kind of thing. Shells mostly just run commands, assign variables, and substitute variables into commands, and rely on other programs like find and grep to do most other things. Wouldn't it be nice if one could combine all these functions into one program, and give it a better language to write programs in. The first of these that really took off was Larry Wall's Perl. Like the others it allows you to put simple commands on the command line -- with exactly the same syntax as grep and awk.

Perl's operations for searching and substituting text look just like the ones in sed and grep. It has associative arrays (basically lookup tables) just like the ones in awk. It can run programs and get their results exactly the way sh does, by enclosing them in backtick characters (`...` -- originally meant to be used as left single quotes), and it can easily read lines out of files, mess with them, and write them out. It has has objects, methods, and (more or less) first-class functions. And just like find and the Unix command line, it has a well-earned reputation for scripts that are obscure and hard to read.

You've probably heard Python mentioned. It was designed by Guido van Rossum in an attempt to be a better scripting language Perl, with an emphasis on making programs more readable, easier to write, and easier to maintain. He succeeded. At this point Python has mostly replaced Perl as the most popular scripting language, in addition to being a good first language for learning programming. (Which is the best language for learning is a subject guaranteed to invoke strong opinions and heated discussions; I'll avoid it for now.) I avoided Python for many years, but I'm finally learning it and finding it much better than I expected.

Extension languages

The other major kind of scripting is done to extend a program that isn't a shell. In most cases this will be an interactive program like an editor, but it doesn't have to be. Extensions of this sort may also be called "plugins".

Extension languages are usually small, simple, and interpreted, because nobody wants their text editor (for example) to include something as large and complex as a compiler when its main purpose is defining keyboard shortcuts. There's an exception to this -- sometimes when a program is written in a compiled language, the same language may be used for extensions. In that case the extensions have to be compiled in, which is usually inconvenient, but they can be particularly powerful. I've already written about one such case -- the Xmonad window manager, which is written and configured in Haskell.

Everyone these days has at least heard of JavaScript, which is the scripting language used in web pages. Like most scripting languages, JavaScript has escaped from its enclosure in the browser and run wild, to the point where text editors, whole web browsers, web servers, and so on are built in it.

Other popular extension languages include various kinds of Lisp, Tcl, and Lua. Lua and Tcl were explicitly designed to be embedded in programs. Lua is particularly popular in games, although it has recently turned up in other places, including the TeX typesetting system.

Lisp is an interesting case -- probably its earliest use as an extension language was in the Emacs text editor, which is almost entirely written in it. (To the point where many people say that it's a very good Lisp interpretor, but it needs a better text editor. I'm not one of them: I'm passionately fond of Emacs, and I'll write about it at greater length later on.) Because of its radically simple structure, Lisp is particularly easy to write an interpretor for. Emacs isn't the only example; there are Lisp variants in the Audacity audio workstation and the Autodesk CAD program. I used the one in Audacity for the sound effects in my computer/horror crossover song "Vampire Megabyte".

Emacs, Atom (a text editor written in JavaScript), and Xmonad are good examples of interactive programs where the same language is used for (most, if not all, of) the implementation as well as for the configuration files and the extensions. The boundaries can get very fuzzy in cases like that; as a Mandelbear I find that particularly appealing.