A point of introduction to C in cLIeNUX

NAME

C, gcc, cc1 - the C programming language

DOC DATE

19990912 <-> 20000309

purpose of this document

This seedoc is intended to provide a point of introduction to using the C programming language on cLIeNUX. It is hoped that sufficient rudiments will be revealed here to write useful programs, and that concepts of using the C development tools will be clarified significantly for many non-programming tasks, such as installing existing programs consisting mainly of C sourcecode. I am not an expert C programmer, but with enough flailing about I can get it to do what I want, and an introduction is needed. I haven't done much checking of what I state here. Actual code examples have been checked. As soon as you can, get other references to fix the damage I do here.

I attempt to present what appear to me to be the basic concepts of C, and this document is not complete as a reference to operators and so on. The basic operators provided by awk, such as +, -, &, % and so on, are almost exactly the same as C's, except as pertains to data typing. See awk for that, and as a sort of a "practice C". Basic concepts of programming may be omitted in lieu of the programming seedoc, which should be studied before this document if you don't know programming at all. If you get lost here, refer back to that seedoc. Some details of this discussion may be specific to GNU C and the GNU development tools, or to cLIeNUX. Some other details are known to be less than perfectly correct in the interest of brevity. C has some extremely confusing aspects, but most aspects are fairly straightforward for a language that produces high-performance results. Once C is learned, the possibilies opened are proportional to the mass of existing C code.

Things for you to input at the terminal, C/cpp keywords, and commands, are usually emphasized in this html like this.

general description of C

The C programming language was developed around 1973 to make UNIX portable. C is a procedural language. That is a type of programming language that is not very abstract compared to what a typical CPU actually does. C is at the low-abstraction end of high-level languages. This is one reason the performance of code produced by C is usually quite good, and therefor hand-coding assembly or machine language in conjunction with C is typically only done in circumstances where it is unavoidable. Almost all of a typical Linux/GNU/etc. distribution is written in C. C++, also from those wacky boys at Bell Labs, is basically a superset of C. That is, conceptually, C++ is written in C. C++ is not included in cLIeNUX Core.

Knowing C, or being aware of C terminology, and the terminology of the assembler, linker and other utilities associated with C, is of great help to users of unix in activities other than programming, and also for programming most other languages popular in unix.

Documentation for unix that tells you how things actually work often assumes a knowledge of C. That's not good, since C isn't the definition of computing, but it is very good in that in many ways unix makes no distinction between user and programmer, which may be difficult at first but is ultimately very empowering.

The C compiler proper, cc1, which is what implements the actual C programming language, is one component in a suite of tools. At least four main components are usually implied by the phrase "written in C"; the C pre-processor, the C compiler, the assembler, and the linker. There are other preprocessors and translators of various types for use with C. There are also two prevalent "front-ends" for the entire compilation process. "Written in C" is represented directly by the gcc command. The GNU gcc (or cc) command is "driver" in the top-down sense of the term, an interface to and manager for the four main programs mentioned above, which in these examples will be the GNU cpp C preprocessor, the GNU cc1 C compiler, the GNU as or gas assembler, and the GNU ld linker.

Large C-based programs consisting of many files are invariably built under the control of the make utility. make, cpp, as and ld can all be used for tasks unrelated to the C compiler or libc, but are designed and heavily defaulted to work with them, and each other. cpp in particular is bothersome to use as a macro-processor for something other than C sourcecode. See m4 for general macro-processing. System-wide subroutine linking libraries in unix observe the C conventions for parameter passing, header (#include) files and so on.

background, from the bottom up, machine-wise and historically

Using C benefits from a basic grasp of the layers underneath C, since it resembles or works closely with them, and was written by the kind of assembly language programmers that "They don't make 'em like that any more". Common C terminology tends to assume a knowledge of assembly language practice and concepts. I hope to present some of that here.

C, and GNU gcc in particular, allows detailed control of what level of abstraction you operate at. gcc can in fact compile several versions of C, "K&R" or "traditional", ANSI, and "GNU", which allows a lot of syntactical constructs ANSI forbids or leaves undefined. Some of my own code depends on the GNU C extensions, since GNU C, like GNU software in general, is amazingly portable and widely used, to the point of being perhaps a de-facto standard for C.

Unfortunately, something that is capable of high abstraction, extreme flexibility and utter specificity takes a lot of explaining. Fortunately, we are presenting C right in it's natural habitat, it's home court, and everything works pretty much as expected, right at your fingertips. You are urged to investigate things you don't understand in this document by interacting with the C facilities cLIeNUX Core provides.

A typical CPU chip gets it's instructions from RAM as simple binary codes for various operations. These codes are called opcodes. The full set of opcodes a particular CPU implements are called it's machine language, or instruction set. All programming is a matter of arranging these opcodes, and usually arranging some initial data for them to act upon. At the dawn of the digital computer age these opcodes and data were entered by hand with a row of 2-position switches. Input devices were developed to allow these binary opcodes to be entered in bulk, from paper tape, punchcards and so on. Other devices arose to allow input to the computer as hexadecimal numbers instead of individual bits. ( Lost yet? see programming. ) These forms of controlling a computer utterly directly in it's native machine language and without abstraction are called first generation languages.

The GNU objdump utility can display sections of an object file in hex. So can cLIeNUX binedit. "object" has several meanings, and more than one of them is used in this seedoc. In this case an "object file" is a file containing program code in runnable form. "object code" is binary data ready to give to a CPU as it's program. An object file may or may not also be runnable as a stand-alone command. "object" is not used in this document in the sense of "object oriented programming", which is an abstraction layer built on top of what is presented here. Let's look at the object code for the basename command. Note that the code examples in this background section are for perspective, not for detailed understanding.

	:; cLIeNUX0 /dev/tty10 r 05:32:57   /subroutine/static
	:;objdump -s -j .text  /.bi/basename |page


	/.bi/basename:	   file format elf32-i386

	Contents of section .text:
	 8048d70 5989e389 e083e4f8 89ca01d2 01d201d0  Y...............
	 8048d80 83c00431 ed555555 89e55053 51b88800  ...1.UUU..PSQ...
	 8048d90 0000bb00 000000cd 808b4424 08a3c8ce  ..........D$....
	 8048da0 04080fb7 05dcd004 0850e859 ffffff83  .........P.Y....
	 8048db0 c404e851 feffff68 70ba0408 e8f7feff  ...Q...hp.......
	 8048dc0 ff83c404 e837fdff ffe80a02 000050e8  .....7........P.
	 8048dd0 24ffffff 5b8d7426 008db426 00000000  $...[.t&...&....
	 8048de0 b8010000 00cd80eb f78db426 00000000  ...........&....
	 8048df0 5589e553 bbfcce04 08833dfc ce040800  U..S......=.....
	 8048e00 740e89f6 8b03ffd0 83c30483 3b0075f4  t...........;.u.
 
Pretty cryptic, isn't it? The above format is affectionately known as a hex dump. The left column is the process virtual address of the first byte of the line, then there are 16 bytes of memory shown in hexadecimal, and in the right column bytes that have printable ASCII representations are shown as such. Bytes that aren't printable in ASCII are accounted for with periods. That's actually more user-friendly than what the CPU sees. That's the first 160 bytes of the .text section of the file /command/basename. ".text" is the ELF section of an executable file that contains the actual code of the program, but to be honest, there is sometimes non-code stuff in a .text section, and from this display I can't tell for sure if the above is actually machine code or some other data. Trying to talk to computers in a form like the above quickly gave rise to the second generation of programming languages, known as assembly languages.

Assembly languages are simple translations of opcodes to short text names for the opcodes, called mnemonics, and some other rudimentry conveniences. The C compiler itself produces assembly language to be processed into binary object code by an assembler. Machine language monitors also exist to convert assembly language to binary object code directly in RAM interactively, and other on-the-metal activities. "File-to-file" assemblers are more prevalent on multi-user operating systems than machine language monitors, which give total interactive control over the machine to an extent that is unsuitable for a running multi-user system, since frequently crashing the machine is normal when using a machine language monitor. The GNU debugger may have much of the functionality of a machine language monitor, I don't know. I manage to stay out of gdb most of the time. Here's an example of typing some C code at the cc1 C compiler directly. Again, this is just for perspective; don't be too worried about details just yet.

	:; cLIeNUX0 /dev/tty10 r 06:39:15   /subroutine/static 
	:;cc1
	blah(){
		.file	"stdin" .version	"01.01"
	gcc2_compiled.:
	 blah
	int zay;
	zay = zay + 3; } 
		.text
		.align 16
	.globl blah
		.type	 blah,@function
	blah:
		pushl %ebp movl %esp,%ebp subl $4,%esp addl $3,-4(%ebp)
	.L1:
		movl %ebp,%esp popl %ebp ret
	.Lfe1:
		.size	 blah,.Lfe1-blah

 (at this point I did ctrl&c to exit)  
OK, not very clear. The C code I input was
	blah(){ int zay; zay = zay + 3 ; }
The compiler produces valid assembly language for the GNU gas assembler. On this box the assembly is for the x86 family of chips. The output contains linker directives, interspersed with the following x86 assembly language instructions...
		pushl %ebp 
		movl %esp,%ebp 
		subl $4,%esp 
		addl $3,-4(%ebp)
		movl %ebp,%esp 
		popl %ebp ret
Admittedly, this probably doesn't make much more sense than a hex dump, but if you know x86 assembly it does. I don't know x86 assembly well myself, except for things like "ret" means "return from subroutine". Even if one knows neither C nor assembly, the above illustrates some things. The difference in size between the C code and the assembly language code vaguely supports the statement that C is at the low end of high-level languages. Also, one can clearly see that to get from C to the CPU one does in fact have to pass through the lower-level stages. The lower level stages aren't gone, and they aren't obsoleted, they are just subordinated, and usually occur in the background. Also, if you want to know what your C code actually results in, you can see that you have the tools to find out.

There's some other things that aren't actually self-evident from the above, but that one can imagine when looking at that example. If we did the same thing on a machine set up to assemble object code for some other CPU besides the x86, such as the PowerPC, the same C code input would produce different assembly language output. That is portability, and is the most important characteristic of third-generation languages.

There are a couple more things I want to point out about the assembly code that will explain a lot later. our blah() in the C code is what is called a function in C. It, and it's contents, which is what is between the curly-braces, are a named coherent functional unit of code. "blah()" in the C input became


	.globl blah
		.type	 blah,@function
	blah:
 
in the assembly language output. The ".globl blah" says that the following blah: label is a "global symbol". This is very important to the linker. This is how the entry points of routines are found in object files and libraries during linking. You'll hear about symbols all the time when dealing with linking, which you'll hear about a lot when building large C programs that aren't pre-packaged for your setup.

Unsuccessfully doing nothing in C

For an actual example of C resulting in a working program, one would like to begin with the simplest example possible. Given the fact that C is really all in parts, there are a couple of meanings to "simplest". The simplest way to get a working command out of C is not the simplest example of legitimate C code, in terms of facilities used. An object file that's not a command, such as a component in some other program or a library, might use fewer features of the build environment, but we want a command. gcc and the standard C library provide facilities for making a stand-alone program. Our first actual code example will use those facilities. If we use the name main() for the main routine of our program, the handlers will all be included by gcc to make a runnable command. OK, make a directory somewhere called /source/box/C_revelations or something, and edit a file called null.c in it to contain the following text...

	main(){}

Did that? OK, now do
gcc null.c
You should then have a new file, named a.out by ancient tradition. That's your new executable. The text "main(){}" is the absolute minimum that will compile into an executable. gcc did a lot of processing and assumed a lot of default choices to produce a unix/Linux ELF executable from null.c.

:;ls -l  
-rwxr-xr-x 1 r root 3772 Sep 12 10:46 a.out :;file a.out a.out: ELF 32-bit LSB executable, Intel 80386, version 1, dynamically linked, not stripped
We see that main(){} became a 3772 byte dynamically linked executable command. "not stripped" means the symbols used to link it and various other info are still in it. You know what symbols are, right? Good. Even stripped, it would be much bigger than whatever few opcodes are required to express a main() that does nothing because it has an ELF format data structure in it so the system can handle it properly. In the case of a dynamically linked executable that includes a call to the dynamic linker, ld.so-linux.

What all did gcc do? null.c doesn't include any preprocessor directives, but gcc can't know that ahead of time, so GNU cpp was run on it anyway. The output of cpp, which in this case was the same as it's input, our program text, was then passed to cc1, the actual GNU C compiler executable, and the C code was compiled into x86 assembly language. The assembly language code was passed to as, the gnu assembler, which assembled the assembly language text representation of the code into binary object code. as also does just enough adding of header information so that the object file it produces can be handled by the linker. ld took the object file produced by as, and linked it to a file gcc keeps handy named crt1.o that actually puts wrapper code around main(). The wrapper code precompiled in crt1.o endowed our program with unix commandline and environment variable processing, and some libc initialization. ld also linked our program to libc.so.(version#), so that if we had used any functions from the standard C library they would be available at runtime. You can get an account of all this processing with the -v and --save-temps switches to gcc, ala


	:;gcc -v --save-temps null.c

Our current a.out is a full-fledged unix executable. The OS gives it the full suite of facilities for a process when you run it, so you can time it, you can do pointless things with redirection operators, and you can give it commandline arguments and environment variables to ignore, but all that stuff is the OS at work. We have actually failed to produce a program that does absolutely nothing, however. The wrapper code that calls main(), that gcc linked in from crt1.o, uses the _exit() system call to end the process. That's the only way to end a process, and it always returns a byte to the calling process. In our case it always returns zero, but that's doing something. That is about the least a unix command can do, though.

One more aspect of doing nothing that bears mention is proper commenting. Comments should state *what* a routine does. The code is *how* it does it. Sometimes stating what things don't do is important also. For future reference to null.c, you may want to edit it something like this...


	/*  null.c minimum C program. Returns 0, because returning
		void from a Linux command isn't possible. */

	main(){}

Doing something

OK, you've picked up some assembly lingo, and you realize that there's massive work done under the hood when making an executable from C. Note also that everything gcc does can be specified individually. gcc is the manager for lots of parts, but all the parts are accessible individually. You can do whatever you want. The sticky parts are A: you have to know what you want, and B: you have to be able to express what you want. Now we can delve into expressing what you want in C itself like someone that wants some results.

C can be thought of as a core language, and a standard library. That is the overall format of the current ANSI C standard, that's how it's implemented as cpp/cc1 and libc, and it is rather analagous to the CPU and the peripherals of a computer. The core language defines the C virtual CPU, and the library routines provide things like files, sockets, high-level math functions like trigonometric functions, and so on. We'll emphasize the core language, but we'll need some other functionality to use it, just as you need some kind of input/output to a CPU to control it. We will continue to use the main() interface to provide us with a testable command, and we will use the commandline argument handling provided with it to give our program something to work on that isn't necessarily the same every time the program runs. We'll use the return value of main() for output, even though it gets truncated to just a byte, and is really intended to provide just a sucess/fail flag. Type the following into a file named plus5.c.


	/* plus5.c   add 5 to a commandline argument */

	int main(int argc, char *  argv[]){ return atoi(argv[1]) + 5; }
 
This probably looks real bad to a non-C programmer, but it's much better than a hex dump. Let's pick it apart. The first line,

	/* plus5.c   add 5 to a commandline argument */
is a comment. The C preprocessor comes first and replaces everything from the /* to the */ inclusive with a single blank. Right after the first line is another example of my sloppy coding style. There should be a line saying

	#include <stdlib.h> 
and there isn't. This is because later in the program, we use the atoi() libc call, and we should #include the header file for it. However, this is such a standard #include that gcc includes it for us by default. For a more obscure call than atoi(), or without gcc, this kind of sloppiness will not work. #include is a facility of the preprocessor. It's kindof the opposite of a comment, in that comments get removed, #include's get inserted. The greater-than/less-than around stdio.h is shorthand for "in the standard header files directory" which in cLIeNUX is /source/C/include by default and is /usr/include on most other unices. Next comes

	int main(int argc, char * argv[]){ 
Wow. A lot of the trickiness of C, and the main() interface, comes to a head right here. Well, in the interest of brevity, I'm going to ignore it. For the purpose of introduction, just note that this, exactly, is the magic incantation you use to use the commandline facilities associated with main(), and that argv is given to us as an array of pointers to char's.

The { and } in our program define the limits of the body of main(). main() contains one statement. Statements are terminated with a semicolon. Our statement,


	return atoi(argv[1]) + 5;  
says to return from main(), i.e. quit, and to provide the return value of the first commandline argument after the program name, converted from an ASCII string to an int, and added to 5. C converts things in an expression like that from innermost parenthesized group to outermost. First it gets argv[1], which is the first commandline argument. argv[0] is the program name, which we don't happen to use within our program. atoi() converts the string to an int, then 5 is added to that, then return has everything it needs and it does it's thing, and the program is finished.

This is a very sloppy program. If it doesn't get an argument at all after the program name, it segfaults. If it gets an argument it can't convert to an int, it thinks it's 0 and it returns 5. If the argument + 5 is more than 255, it gets truncated to a byte by _exit. But, when you write your own code, you determine whether or not the program is appropriate for the task at hand.

Format of a C program

From the top, a C program has an overall form like this

						(not real C code)
	variable declaration/definition .  .  .

	function declaration/definition .  .  .

	main definition

That's for an executable. Code for routines to be linked with something else won't have main(). "main", by the way, isn't formally a C reserved keyword, but it's use is ubiquitous. Variables can also be declared inside functions. If they are declared outside any function they are visible to any function in the file. "visible to" means "usable by". If they are declared within a function, they are only visible within that function, but that includes being visible to functions when called within the same function the variable in question is defined in. These issues are called scoping. A variable that is only visible within a function is "local" to that function. There are also "storage class" qualifiers for variables, but I'm not going to address that.

#includes of necessary header files are usually at the top, but the include mechanism works at any point in a file, except within comments, since cpp does comment removal first. The preprocessor can also cause sections of code to be included or omitted depending on variables in cpp's own variables namespace. For any program of any size, various mechanisms will be used to maintain the program in various parts, but at build time it will all resolve to the above general outline by the time cpp hands it to cc1. Routines linked from precompiled libraries are semantically like variable declarations and non-main() function declarations. The structure of a variable declaration is


						(not real C code)

	type [qualifier] name[= initializer] [, name, name...];
where [ ] encloses optional material. What you are doing when you declare a variable is initializing storage, which happens before the program starts, so an initializer must be something that can be determined at that point in time, such as a constant number, string, or expression that can be resolved unambiguously. There's a concise definition of "expression" in C, but I don't know exactly what it is offhand. Lets say for now that it's a description of a simple computation to be performed that produces one value. According to the ANSI lingo, if a variable declaration has an initializer it's a "definition".

All actual runtime code must be within functions. Declarations and definitions don't have to be within functions because a program can have initialized data in it's memory image. The structure of a function definition, including main, is this...


						(not real C code)
	return_type name ( [argument...] )  
	{ 
	[statement ; 
	statement ;
	statement; ....]  
	}

Line spacing and indentation are just shown for illustration. Proper formatting is important for readability and maintainability, but sequences of blanks, tabs and newlines are all the same as a single blank to C. Statements may be variable declarations, assignment statements or calls of existing functions. All the statement lines are optional. A function that does nothing is sometimes useful as a stub for future code. All data C handles must have a data type, so a function definition must declare the type of data it produces, that is, the data it returns to it's calling function. GNU C allows function declarations within a function. An assignment statement has the form
						(not real C code)
	ob_expression = expression ;
ob_expression means some code that can be resolved to an object, i.e. an entity that can store a value. This is a new meaning of the term "object" within this seedoc. The simplest example of an object in this sense is a variable. Objects are also called lvalues. In other words, the left side of an assignment must represent a storage location of some kind in memory. The = sign is the basic assignment operator. It means when this statement has been performed the object on the left (the lvalue) will contain the results of having performed the expression on the right.

controlling the order of execution

Execution of a C program begins at the first statement in main(). The default sequence of execution flow of the program is from top to bottom within a function, including main(), which can be altered with structured flow control constructs and labels/gotos. The time sequence of actions within a statement is determined by the the precedence of the operators used, which operators are used, and parentheses. C has hairy precedence rules for it's many operators, so use lots of parentheses. The action of parentheses in C expressions is fairly intuitive, as far as plain expressions are concerned. Unfortunately parentheses are also used for arguments delimiters for functions and flow control constructs, and for the typecasting operation. More on these later. Meanwhile, parentheses are a welcome simplifier for expressions.

As pertains labels and structured flow controls such as for-loops and while-loops, a statement is a unit of program flow. There are, however, two constructs to control and alter program flow within a statement.

A comma is an operator in some contexts. The comma operator creates a compound expression. For each comma operator, the left side is evaluated, and it's results other than it's value as an expression are asserted, which may include changing values in variables and so on. The right side of the comma operator expression is then evaluated, which may be effected by the side-effects of evaluating the left side, and the value of the comma expression as a whole is the value of the right side. This creates a sub-program within an expression, and is sometimes used for complex behavior where an expression is expected, such as in the loop control specifiers of a for-loop. The comma expression


	j = 2, j * 4

will evaluate to 8, with the type j has. The term "side-effects" in the above sense is usually applied to functions, and means changes to things besides thier return value.

The conditional operator is an "if" construct within an expression. The format is

					(not real C code)

	expressionC ? expressionT : expressionF

The example represents one expression. It's value is the value of expressionT if expressionC has a value of other than zero (false), and it's value is the value of expressionF if expressionC is 0. The side-effects of expressionC and the other expression evaluated are asserted. In other words, expressionC is the conditional, ? is the true/false test, expressionT is the part to be performed IF TRUE, and expressionF is the part to be performed IF FALSE.

When a function is called, it is entered and it's sequence of statements and it's flow controls are performed until such time as it returns to the caller. Functions nest arbitrarily deep. That is, a function may call a function which calls a function which calls a function etc. etc. A function may call itself; this is called recursion.

Within a single function, there are a variety of flow control constructs available in C to implement conditional execution of code sections and loops of various kinds. The rudimentry, un-structured flow modifier is a goto combined with a label. A label is specified like


	labelname:
and represents the statement following it. C code in the form of

						(not real C code)
	toploop(){ 
	statement a; 
target: statement b; 
	statement c; 
goto	target;		   /* statement d */ statement e; }
will do statements a, b, c and d, and then loop endlessly over statements b, c and d. Without other provision for changing the flow, statement e will never be executed and toploop() will never return to it's caller. Endless loops are useful in some situations, such as the top user-interface loop of an interactive program. Statement e in this example is what is called "dead code". The C compiler might optimize it away in the final object code, if optimization is being used.

A goto can go to a label anywhere in the same function. In particular, it can cross the block boundaries of the other flow control constructs I'm about to describe. This has issues. See programming for comments on goto.

STRUCTURED PROGRAMMING
sections and jumps
Several statements enclosed in curly-braces { } are called a block. A block is syntactically equivalent to a single statement; a block or a single statement are interchangeable in the syntax of most flow control constructs. Blocks, like function definition bodies, do not have a trailing semicolon. The difference between blocks and the braces in a function definition is the braces are not optional for a function, but may be for other constructs.

The break statement will exit several types of flow-control blocks, and is necessary for normal use of the switch/case construct. The continue; statement is used in loop constructs to end the current loop iteration without leaving the loop, i.e. start the next iteration of the loop immediately. The return statement leaves a function, and can pass a value to the calling function it is returning to. Falling through to the end } in a function is equivalent to return 0; . Note that a return statement can be inside a flow-control construct like a while loop.

decisions

Conditional execution of statements may be caused by an if construct. The general format is

						(not real C code)
	if ( expression )
		statement or block
	else if ( expression )
		statement or block
	else if ( expression )
		statement or block
	else
		statement or block
The expressions are evaluated until one evaluates true (non-zero), or until the else is encountered, and the following block/statement is executed. Then flow resumes after the else part, outside the conditional. Each section but the "if" section is optional. That is, the simplest case is

						(not real C code)
		if ( expression )
			statement
This is an example of the generality of { } blocks. The statement following the if clause can be a single ;-terminated statement, or a braces-enclosed block of statements.

The usual case construct is called switch in C.


						(not real C code)
	switch ( expression ) {
		case constant :	statements 
		case constant :  statements
		case constant :	statements .  .  .  
		default : statements
	} 
The expression is evaluated, and the case with the matching value for its constant is jumped to. If no case matches then default: is jumped to. The constants in the above may be expressions. They must each evaluate to a unique integer within the set. This is in effect a multi-target goto with numbered labels, where the expression determines which case is the goto target label. Flow does not automatically exit the construct after a case is executed. That means break statements must be used to end atomic cases, or flow will fall through into the following cases. The order of the cases is not rigid, and how you order the cases may effect which case is tested for first, which may effect performance. That is, in the ones I've compiled anyway, a switch construct becomes several discrete tests and branches, and you may want the most frequent cases first.

tested loops
The while-loop construct tests an escape condition at the beginning of each iteration of the loop. The do-while loop construct tests the escape condition after each iteration of the loop. do-while is often used when a loop is intended to always iterate at least once.


						(not real C code)
	/* while loop */

	while ( expression )
		statement or block


	/* do-while loop */

	do
		statement or block
	while ( expression ) ;

counted loops
A counted loop can be constructed from while or do-while. In fact, any flow control construct can be created with if and gotos/labels, but not having to do that is one reason third-generation languages were developed. C provides the for loop, which is very general, but is intended as a convenience for counted loops. It's format is

						(not real C code)

	for ( init_expr ; test_expr ; incr_expr )
			block/statement

Here's an example program using a for-loop...

	/* for.c    for-loop demo  */

	#include <stdio.h>	/* declare printf from libc   */

	int i;				/* we need a loop increment
	var. */

	main(){				/* program takes no arguments */

	for (i = 0 ; i < 30 ; i = i + 1)	/* for 0 thru 29,
							count by 1's  */
		{ printf ("%d	   ", i );	/* print the count as a
						decimal number, with
						some trailing blanks */
		}

	printf("\n");			/* print a newline when done
	looping */

	}				/* end program, use default
	return */
That's fairly plain-vanilla C. It differs from most code in that the comments are a bit verbose, (and crunched a bit for html,) since normally one could assume the reader knows C. Also, "i = i + 1" is usually expressed with the C increment operator ++, e.g. i++ . The curly-braces around the single-statement for-loop body are unnecessary, but typical for clarity. The indentation style is what I use. The documentation for all the libc calls including printf are not in cLIeNUX Core, but they are in a package. printf is almost a language unto itself, with lots of formatting and conversion options. Paste the above into a text file named for.c, gcc for.c, run it, change it, make it do something clever.

DATA TYPES

C is called a "typed language". When your code does something like + in C, the compiler figures out what kind of things you are adding, and then creates the appropriate assembly code. This is also called "operator overloading", because operators like +, - , %, /, << and so on have several possible meanings depending on the data types of the entities they are currently being invoked on. Really C is a "typed-data language". If you don't have typed data, then you usually wind up with typed operators, i.e. various operators for various datatypes.

The basic types in C are void, char, int, and float. A void object has no size, and is sometimes useful with "pointers", addresses of other objects. In other words, void isn't nothing, it's an address of something of un-specified size or type. char is, in practical terms, a byte. int is usually the same size as a machine address in a particular implementation, which on Linux x86 is four bytes. float is a floating-point number.

Possible qualifiers of the above types include unsigned, short, long (for ints), double (for floats), and signed (for chars). The const qualifier states that the object is constant, and the volatile qualifier says that the object's value may be changed by something other than the program. volatile may be a necessary qualifier when an object represents an input/output port of some kind, for example.

Data types are part real and part abstraction. An actual storage location for a datum, an lvalue, has a certain size. That's a very real constraint. If your C code tells the compiler a variable is an int it allocates 4 bytes for it (on x86). If you declare it unsigned, then that's an abstraction, and effects how the data is handled, but it's still 4 bytes. Sizes of things are a matter of physical reality, but typing information more specific than that is entirely a service of, and internal to, the compiler.

Situations often come up where you want to add two integral types of different sizes. C will do a lot of different type conversions if situations arise where it seems OK to do so. Usually what is allowed is a "promotion", from char to int for example. If you add a char and an int, the value produced will be carried around in the compiler's idea of things as type int, which is the conversion, a promotion, that results in no loss of information. That is, an int holds all the bits of a char without losing any.

type casting

Type conversion, causing C to handle something as some particular type, can also be caused deliberately by the programmer with the C cast operator. In ANSI C you can't cast a memory-allocated object to some other type, since an object has some predetermined amount of actual memory storage. gcc however does allow casting lvalues if it's physically possible to do so, i.e. for types of the same size, such as ints and pointers (usually). The syntax of the cast operator is

	(type) expression
That means that declared types of things are thier defaults, but you can do just about anything to them, as may be desirable. There are a lot of possible conversions though, and what happens when you cast e.g. a float to type unsigned int is something you had better check in your particular C implementation if you need such strange behavior.

pointers

An object whose purpose is to hold the memory address of other objects is called a "pointer". Most useful programs involve pointers in one way or another. C provides the unary & operator, and a unary expression like

		( & my_variable )
evaluates to, or returns, or is seen by the compiler as, the address and type of my_variable. Let's say you have an int variable called fake_pointer. If you do

	int fake_pointer;

	fake_pointer = (int) & my_variable;
that statement will result in the contents of fake_pointer being the address of my_variable, so you've created a pointer. You've lost some information though, or rather the compiler has lost some information. When you stored & my_variable in an int, (which you'll get compiler warnings about if you don't do the cast to type int,) the compiler lost track of the datatype of my_variable. All you stored was my_variable's address. You can keep track of types yourself and handle the necessary conversions with casts, or you can declare variables to be pointers to objects of some type.

Given an address in a variable, you need some means to obtain the object that address points to. This is called "dereferencing". The name of a similar operation in the parlance of the Forth programming language is "fetch", which I think is rather intuitive. A thing that points at another thing is also known as a "degree of indirection". The C fetch or dereference operator is unary *. That is, * not in an arithmetic expression, but rather preceding the name of a pointer. I suspect that perhaps one of the really confusing things in C is that & and * are not exactly symmetrical. This is because of data typing. You can't directly fetch something with an int, because that doesn't get you a datatype for the pointed-to object, which in most cases is useless, so C doesn't allow it. You can fetch something with an int though, with a cast. What you are doing with the cast is providing information C needs to keep track of types. Because of operator overloading, because e.g. + is various operations for various types, types have to be kept track of by you, with casts, or by C, based on declarations.

 

        int fake_pointer, my_variable, other;
	my_variable = 77 ;

        fake_pointer = (int) & my_variable;

	other = *(int *)fake_pointer;
"other" now contains the the same value as my_variable, 77, but was passed that value using just addresses. It also happens to have the same type, int. Doing it that way means that you, the programmer, kept track of the datatype. Sometimes you may want to do that, usually you don't. I usually do, but I'm weird.

Casts can be pretty arbitrary, especially in gcc, and arbitrarily complex. Casts bind right-to-left, so the *(int *) is a cast to pointer to int, specified by the (int *), followed in time sequence by a fetch or dereference, specified by the *. That's the minimum you have to do to an object declared int to dereference it as a pointer.

More typically, and more conveniently, but maybe not as clearly, you can declare variables specifically for pointers as type "pointer to [type]". A declaration of a variable to hold the address of another object of type float, for example, would be


	float * my_float;

 
That creates an object the size of a machine address that is considered to be the address of an object of type float. The compiler then handles my_float and the object it points to various ways depending on context.

	/* pointer_demo.c, pointer values and pointer net values */

	main(){ 
	int a, b ;		/* declare a couple ints */
	int * p, * q;		/* declare a couple pointers to ints */
	a = 777;		/* give our int a value  */ 
	p = &a;			/* set what p is pointing at  */ 
	q = p ;		 	/* copy a pointer to a pointer.
					The address is copied. 
					*/
	printf("%d\n", * q );	/* print the object/net value the copy of
					the pointer points at  */

	b = 4;			/* initialize our other int */ 
	*q = b ;		/* change the value of what q is pointing at */ 
	printf("%d\n", *q );   /* print the contents of what q points at */ 
	printf("%d\n",a );	 /* print what we set to 777, and then reset
					to 4 indirectly via a pointer  */
	}
A pointer to void is an address, but the pointed to object has no size. Pointers to void are used at times to handle pointers that you want to point at different types of objects at different times in the program.

derived types

Compound objects or types with direct support in C are strings, arrays, structs, enums, and unions. I guess in a sense a pointer is a compound object also. Compound types are in a sense clusters of pointers of various types that C handles internally to the particular type. C/unix handles strings as pointers to type char. The fact that strings vary in length is handled by terminating strings with a zero byte. The zero-terminator convention is the only thing that makes a C string any different than a regular pointer to char.

	char *stringy_thingy;

	stringy_thingy = "My string\t\t\t for illustration\n" ;

 
That declares a pointer to type char, and then initializes it to the string literal shown. The actual value stringy_thingy will then contain is the address of the "M" in "My". \t is the C string literal escape to include an ASCII tab byte in a string, and \n represents a newline. This convention is reflected in most unix programming languages. When you declare a string literal C puts a zero on the end of it, and then the various library routines and so on can traverse the string from it's pointer address up to the zero. This is the case with stringy_thingy, and with the printf format control construct %s. For example,

/*  string.c  demo of null-terminated string   */

	char * string = "blah blah woof woof   " ;

	main(){ printf("%s %s %s \n\n\n", string, string, string );
	}
A struct is an arbitrary grouping of data. The struct mechanism allows an arbitrary data grouping to be replicated. Structs also implement a hierarchical naming scheme for data somewhat like the pathnames of a filesystem. A union is an object that can be accessed as more than one type.

/* data_toy.c	   play with a struct within a union  */

	union convrt {
		char b[20]; 
		struct clump {
			int	header ; 
			int	 body[4]; } 
		clmp ;
		} convert ;

	main()
	{ int i ; 
		convert.clmp.header = 5555; 
		for (i = 0 ; i < 4 ; i++)
			convert.clmp.body[i] = i * 4444 ;
		for (i=0; i < 20 ; i++)
			printf("%d   ", convert.b[i] );
	printf("\n"); 

	}
I didn't comment this one, because I'm going to unravel it here. The mess above main() is a union definition. The union is named convert, and is a union of a struct and an array of chars. The definition of the struct is contained right in the definition of the union. Recall that runtime code can't be outside a function. The things within the struct and union definitions that look like statements, i.e. that are ended by a semicolon, are the individual components of the compound structure definition. Unions and structs have the format

				(not real C code, [] encloses options)
	struct classname {
		type	fieldname; 
		[ type  fieldname; 
		.  
		.  
		. ] }
		[ instance_name, ... ] ; 
	
The format for a union would say "union" where the above says struct. If there are no instance names then it's a declaration. If there are instance names, instances of that type are defined and allotted storage. For a struct, the system makes a data format to keep the fields organized. For a union, the fields actually overlap in memory. In other words, a union provides a physical location that you can manipulate as various types. Unions, like casts, are an escape-clause of sorts for C typing.

Looking back to data_toy.c, our array of chars overlaps our struct. That means we can fill the struct as the data types it's defined as, and look at it as consecutive chars, unsigned bytes. This is a form of low-level data conversion. By running data_toy.c you can see how C types are actually stored as bytes.

What else did I introduce in data_toy.c? Too much, actually. Well, the struct/union equivalent of the / in the unix file namespace is a period. Fields in a struct are a compound name of the form struct.field, similar to dir/dir/file in a unix filesystem. Also, this time I gave the sizes of the arrays at declare-time. A union will be the size of it's largest field, but the declaration has to know what size that is.

An enum is a sequence of names for sequential constant ints. I don't find enums very useful.

cpp THE PREPROCESSOR

The C preprocessor is used extensively in most large C programs. It responds to certain character sequences in it's input by performing various modifications of it's input, such as removing /* */ comments. It is something like a programming language, but quite unlike C. It has variables and so on, but in it's own context. All cc1 sees of the cpp variables and so on is thier effects on the C source sent to cc1. cpp is what is called a macro processor. Macros are text representing other text. cpp thinks in lines, unlike C; in cpp newlines aren't the same as other whitespace.

We can divide the directives to cpp into two classes by what text causes them; special and named. Comment removal and line joining are caused by specific simple character sequences in the input. You know about /* and */ around comments. Line joining is caused when a line ends with a \ and an immediately following newline. \ allows to you make multi-line cpp directives.

Named cpp directives all require a # prefix. #include is one. The prefix must be the first non-whitespace character on the line. There may be whitespace between the # and the name. Here are six legit named cpp directives...


	#include 
		#     include "app_local.h"
	#define PI 3.1415926 
	#if 0
	   #   define  PI 3
	#endif 
If the above were in a C source file, when processed by cpp, the files ./app_local.h and the stdlib.h in the standard system header files would be inserted into the file, the macro PI would be converted to the string 3.1415926 anywhere it occured in the file, and the conditional action of the next 3 lines shown wouldn't have any effect on the output, since 0 is false. cpp doesn't do any math. 0 and 3 and 3.1415926 are just text strings to cpp. Actually, I'm not sure about 0, but the point is cpp is all about text-to-text. PI in the above is a macro, text. The prevailing convention, and a good one, is to use all-caps for cpp macros in .c files.

Macros can be constructed with syntaxes, and that take arguments. A macro #define has to use the aforementioned line-joining mechanism to be longer than one physical line. For a macro to take arguments, it has a syntax roughly like a C function definition, and the ( opening the argument list must follow the macro name immediately. The following is legit cpp. cpp also does some checking of whether it's legit C or not. Run cpp on this.


	/* twiddle.c  cpp demo, real but rather bogus C.   */

	int first, second, third;

	# define TWIDDLE(A, B, C) C = B + A + C ; \
			"continued line" ;

	TWIDDLE(first, second, third)

Note that the \ logical line continuation operator only works as such if the very next character in the file is a newline. If there is other whitespace between the \ and the newline it doesn't work. This is one of my pet peeves with unix tools; some things that are invisible have important meanings. "make" and sh-style shells have this mis-feature also.

exercises

Healthy stuff. Even if you are reading this for non-programming reasons, such as wanting to be better at importing apps to cLIeNUX, you should write some code. Pick one of cLIeNUX's small scripted commands, like add and write a C version, and add a feature or two.

bailing out

OK, I've spent too much time on this seedoc. Keep in mind that most things you might have questions about can be tried and proven, which is the best docs. If it's just one feature you are having trouble with a small test/experiment program can be written in a minute or two. You also have the world of open source unix for examples. Start with small stuff. Linux kernel code, as a counter-example, is a huge wad of distant cross-references, and uses a lot of GNU assembly linking tricks. Not for newbies. Instead, look at small utilities.

This file is about one tenth the size of "The C Programming Language" Second (ANSI) Edition, Kernighan and Ritchie, the standard book on C by the authors of the language, which I have had sitting in front of me for this. Trying to present C in this much space is of course absurd. I do think I've given a bit more bottom-up presentation than that work, and I do think there's enough info here to write small useful programs.

C is a third generation language. Various parties represent other languages as fourth-generation. The actual fourth-generation language is Forth, circa 1971. Do as I say, not as I did, and learn C before Forth. Then learn Forth.

RIGHTS
Copyright 1999 Richard Allen Hohensee
This file is released for redistribution only as part of an entire intact cLIeNUX Core.