Monday, 9 November 2009


Corpus research requires computers. Text corpora are held in electronic form, and analysis is done via computer programs. Computer programs are written in programming languages, and if you want to create a program to perform a certain task, you will need to be able to master programming.

However, learning how to program is a time-consuming task, way beyond the reach of the average linguistics student or researcher, who wants to learn about human language by looking at patterns in texts. A carpenter does not usually make hammers and saws, and so a linguist cannot be expected to make their own analytical tools.

So, all you can do is rely on software other people have written for you. That, however, is extremely limiting. Suddenly you can only do the things that those programmers decided for you to be able to do. Other people are telling you what you can or cannot do. You've got a hammer, and all you can do with it is hit a nail. Nothing else. Well, perhaps, but any unforeseen uses require creativity, and usually there could be better tools, such as screwdrivers, which do the task you want much more efficiently.

There is a middle way. While you will not have the time to learn a programming language, there are a number of smallish modular tools, which you can stick together to form new tools. Like building a hammer out of lego. Now you need another tool, you take your lego bricks and assemble them in a different way. It might not be a very effective tool (but usually is good enough), but it is still better than not having a tool at all.

These modular programs are the Unix text tools, and this blog is about them. Here you will find out how to solve basic tasks of text processing. You will be in control, as you can change how things work, for yourself. No need to ask someone to write a program for you. You build your own software from a set of easy-to-use components.

They are called the Unix text tools, because they originate from the Unix operating system. If your computer runs Linux or Mac OS X, then you have a Unix system, and you already have the text tools on your machine, ready to use. If you are using Windows, then you don't. But, there are Windows versions of these tools available, so not all is lost! Alternatively, you can get a so-called Live-CD, and run Linux on your PC, without messing about with your Windows system.

This blog will grow, as more and more examples and exercises are added to it. Eventually we hope to make it into a book. If you have any feedback or comments, please leave a comment on this blog. Your help is much appreciated!

Oliver Mason and Nick Groom

awk script for testing Zipf's law

Ever wondered how to test Zipf's law on a corpus? Here's an awk script that does the job in just two lines: