I attempt to present what appear to me to be the basic concepts of C, and this document is not complete as a reference to operators and so on. The basic operators provided by awk, such as +, -, &, % and so on, are almost exactly the same as C's, except as pertains to data typing. See awk for that, and as a sort of a "practice C". Basic concepts of programming may be omitted in lieu of the programming seedoc, which should be studied before this document if you don't know programming at all. If you get lost here, refer back to that seedoc. Some details of this discussion may be specific to GNU C and the GNU development tools, or to cLIeNUX. Some other details are known to be less than perfectly correct in the interest of brevity. C has some extremely confusing aspects, but most aspects are fairly straightforward for a language that produces high-performance results. Once C is learned, the possibilies opened are proportional to the mass of existing C code.
Things for you to input at the terminal, C/cpp keywords, and commands, are usually emphasized in this html like this.
Knowing C, or being aware of C terminology, and the terminology of the assembler, linker and other utilities associated with C, is of great help to users of unix in activities other than programming, and also for programming most other languages popular in unix.
Documentation for unix that tells you how things actually work often assumes a knowledge of C. That's not good, since C isn't the definition of computing, but it is very good in that in many ways unix makes no distinction between user and programmer, which may be difficult at first but is ultimately very empowering.
The C compiler proper, cc1, which is what implements the actual C programming language, is one component in a suite of tools. At least four main components are usually implied by the phrase "written in C"; the C pre-processor, the C compiler, the assembler, and the linker. There are other preprocessors and translators of various types for use with C. There are also two prevalent "front-ends" for the entire compilation process. "Written in C" is represented directly by the gcc command. The GNU gcc (or cc) command is "driver" in the top-down sense of the term, an interface to and manager for the four main programs mentioned above, which in these examples will be the GNU cpp C preprocessor, the GNU cc1 C compiler, the GNU as or gas assembler, and the GNU ld linker.
Large C-based programs consisting of many files are invariably built under the control of the make utility. make, cpp, as and ld can all be used for tasks unrelated to the C compiler or libc, but are designed and heavily defaulted to work with them, and each other. cpp in particular is bothersome to use as a macro-processor for something other than C sourcecode. See m4 for general macro-processing. System-wide subroutine linking libraries in unix observe the C conventions for parameter passing, header (#include) files and so on.
C, and GNU gcc in particular, allows detailed control of what level of abstraction you operate at. gcc can in fact compile several versions of C, "K&R" or "traditional", ANSI, and "GNU", which allows a lot of syntactical constructs ANSI forbids or leaves undefined. Some of my own code depends on the GNU C extensions, since GNU C, like GNU software in general, is amazingly portable and widely used, to the point of being perhaps a de-facto standard for C.
Unfortunately, something that is capable of high abstraction, extreme flexibility and utter specificity takes a lot of explaining. Fortunately, we are presenting C right in it's natural habitat, it's home court, and everything works pretty much as expected, right at your fingertips. You are urged to investigate things you don't understand in this document by interacting with the C facilities cLIeNUX Core provides.
A typical CPU chip gets it's instructions from RAM as simple binary codes for various operations. These codes are called opcodes. The full set of opcodes a particular CPU implements are called it's machine language, or instruction set. All programming is a matter of arranging these opcodes, and usually arranging some initial data for them to act upon. At the dawn of the digital computer age these opcodes and data were entered by hand with a row of 2-position switches. Input devices were developed to allow these binary opcodes to be entered in bulk, from paper tape, punchcards and so on. Other devices arose to allow input to the computer as hexadecimal numbers instead of individual bits. ( Lost yet? see programming. ) These forms of controlling a computer utterly directly in it's native machine language and without abstraction are called first generation languages.
The GNU objdump utility can display sections of an object file in hex. So can cLIeNUX binedit. "object" has several meanings, and more than one of them is used in this seedoc. In this case an "object file" is a file containing program code in runnable form. "object code" is binary data ready to give to a CPU as it's program. An object file may or may not also be runnable as a stand-alone command. "object" is not used in this document in the sense of "object oriented programming", which is an abstraction layer built on top of what is presented here. Let's look at the object code for the basename command. Note that the code examples in this background section are for perspective, not for detailed understanding.
:; cLIeNUX0 /dev/tty10 r 05:32:57 /subroutine/static :;objdump -s -j .text /.bi/basename |page /.bi/basename: file format elf32-i386 Contents of section .text: 8048d70 5989e389 e083e4f8 89ca01d2 01d201d0 Y............... 8048d80 83c00431 ed555555 89e55053 51b88800 ...1.UUU..PSQ... 8048d90 0000bb00 000000cd 808b4424 08a3c8ce ..........D$.... 8048da0 04080fb7 05dcd004 0850e859 ffffff83 .........P.Y.... 8048db0 c404e851 feffff68 70ba0408 e8f7feff ...Q...hp....... 8048dc0 ff83c404 e837fdff ffe80a02 000050e8 .....7........P. 8048dd0 24ffffff 5b8d7426 008db426 00000000 $...[.t&...&.... 8048de0 b8010000 00cd80eb f78db426 00000000 ...........&.... 8048df0 5589e553 bbfcce04 08833dfc ce040800 U..S......=..... 8048e00 740e89f6 8b03ffd0 83c30483 3b0075f4 t...........;.u.Pretty cryptic, isn't it? The above format is affectionately known as a hex dump. The left column is the process virtual address of the first byte of the line, then there are 16 bytes of memory shown in hexadecimal, and in the right column bytes that have printable ASCII representations are shown as such. Bytes that aren't printable in ASCII are accounted for with periods. That's actually more user-friendly than what the CPU sees. That's the first 160 bytes of the .text section of the file /command/basename. ".text" is the ELF section of an executable file that contains the actual code of the program, but to be honest, there is sometimes non-code stuff in a .text section, and from this display I can't tell for sure if the above is actually machine code or some other data. Trying to talk to computers in a form like the above quickly gave rise to the second generation of programming languages, known as assembly languages.
Assembly languages are simple translations of opcodes to short text names for the opcodes, called mnemonics, and some other rudimentry conveniences. The C compiler itself produces assembly language to be processed into binary object code by an assembler. Machine language monitors also exist to convert assembly language to binary object code directly in RAM interactively, and other on-the-metal activities. "File-to-file" assemblers are more prevalent on multi-user operating systems than machine language monitors, which give total interactive control over the machine to an extent that is unsuitable for a running multi-user system, since frequently crashing the machine is normal when using a machine language monitor. The GNU debugger may have much of the functionality of a machine language monitor, I don't know. I manage to stay out of gdb most of the time. Here's an example of typing some C code at the cc1 C compiler directly. Again, this is just for perspective; don't be too worried about details just yet.
:; cLIeNUX0 /dev/tty10 r 06:39:15 /subroutine/static :;cc1 blah(){ .file "stdin" .version "01.01" gcc2_compiled.: blah int zay; zay = zay + 3; } .text .align 16 .globl blah .type blah,@function blah: pushl %ebp movl %esp,%ebp subl $4,%esp addl $3,-4(%ebp) .L1: movl %ebp,%esp popl %ebp ret .Lfe1: .size blah,.Lfe1-blah (at this point I did ctrl&c to exit)OK, not very clear. The C code I input was
blah(){ int zay; zay = zay + 3 ; }The compiler produces valid assembly language for the GNU gas assembler. On this box the assembly is for the x86 family of chips. The output contains linker directives, interspersed with the following x86 assembly language instructions...
pushl %ebp movl %esp,%ebp subl $4,%esp addl $3,-4(%ebp) movl %ebp,%esp popl %ebp retAdmittedly, this probably doesn't make much more sense than a hex dump, but if you know x86 assembly it does. I don't know x86 assembly well myself, except for things like "ret" means "return from subroutine". Even if one knows neither C nor assembly, the above illustrates some things. The difference in size between the C code and the assembly language code vaguely supports the statement that C is at the low end of high-level languages. Also, one can clearly see that to get from C to the CPU one does in fact have to pass through the lower-level stages. The lower level stages aren't gone, and they aren't obsoleted, they are just subordinated, and usually occur in the background. Also, if you want to know what your C code actually results in, you can see that you have the tools to find out.
There's some other things that aren't actually self-evident from the above, but that one can imagine when looking at that example. If we did the same thing on a machine set up to assemble object code for some other CPU besides the x86, such as the PowerPC, the same C code input would produce different assembly language output. That is portability, and is the most important characteristic of third-generation languages.
There are a couple more things I want to point out about the assembly code that will explain a lot later. our blah() in the C code is what is called a function in C. It, and it's contents, which is what is between the curly-braces, are a named coherent functional unit of code. "blah()" in the C input became
.globl blah .type blah,@function blah:in the assembly language output. The ".globl blah" says that the following blah: label is a "global symbol". This is very important to the linker. This is how the entry points of routines are found in object files and libraries during linking. You'll hear about symbols all the time when dealing with linking, which you'll hear about a lot when building large C programs that aren't pre-packaged for your setup.
main(){}Did that? OK, now do
:;ls -lWe see that main(){} became a 3772 byte dynamically linked executable command. "not stripped" means the symbols used to link it and various other info are still in it. You know what symbols are, right? Good. Even stripped, it would be much bigger than whatever few opcodes are required to express a main() that does nothing because it has an ELF format data structure in it so the system can handle it properly. In the case of a dynamically linked executable that includes a call to the dynamic linker, ld.so-linux.
-rwxr-xr-x 1 r root 3772 Sep 12 10:46 a.out :;file a.out a.out: ELF 32-bit LSB executable, Intel 80386, version 1, dynamically linked, not stripped
What all did gcc do? null.c doesn't include any preprocessor
directives, but gcc can't know that ahead of time, so GNU
cpp was run on it anyway. The output of cpp, which in
this case was the same as it's input, our program text, was then passed to
cc1, the actual GNU C compiler executable, and the C code was
compiled into x86 assembly language. The assembly language code was passed
to as, the gnu assembler, which assembled the assembly language
text representation of the code into binary object code. as also
does just enough adding of header information so that the object file it
produces can be handled by the linker. ld took the object file
produced by as, and linked it to a file gcc keeps handy
named crt1.o that actually puts wrapper code around main(). The
wrapper code precompiled in crt1.o endowed our program with unix
commandline and environment variable processing, and some libc
initialization.
ld also linked our program to libc.so.(version#), so that if we
had used any functions from the standard C library they would be available
at runtime. You can get an account of all this processing with the
-v and --save-temps switches to gcc, ala
:;gcc -v --save-temps null.c
Our current a.out is a full-fledged unix executable. The OS gives it the full suite of facilities for a process when you run it, so you can time it, you can do pointless things with redirection operators, and you can give it commandline arguments and environment variables to ignore, but all that stuff is the OS at work. We have actually failed to produce a program that does absolutely nothing, however. The wrapper code that calls main(), that gcc linked in from crt1.o, uses the _exit() system call to end the process. That's the only way to end a process, and it always returns a byte to the calling process. In our case it always returns zero, but that's doing something. That is about the least a unix command can do, though.
One more aspect of doing nothing that bears mention is proper commenting. Comments should state *what* a routine does. The code is *how* it does it. Sometimes stating what things don't do is important also. For future reference to null.c, you may want to edit it something like this...
/* null.c minimum C program. Returns 0, because returning void from a Linux command isn't possible. */ main(){}
C can be thought of as a core language, and a standard library. That is the overall format of the current ANSI C standard, that's how it's implemented as cpp/cc1 and libc, and it is rather analagous to the CPU and the peripherals of a computer. The core language defines the C virtual CPU, and the library routines provide things like files, sockets, high-level math functions like trigonometric functions, and so on. We'll emphasize the core language, but we'll need some other functionality to use it, just as you need some kind of input/output to a CPU to control it. We will continue to use the main() interface to provide us with a testable command, and we will use the commandline argument handling provided with it to give our program something to work on that isn't necessarily the same every time the program runs. We'll use the return value of main() for output, even though it gets truncated to just a byte, and is really intended to provide just a sucess/fail flag. Type the following into a file named plus5.c.
/* plus5.c add 5 to a commandline argument */ int main(int argc, char * argv[]){ return atoi(argv[1]) + 5; }This probably looks real bad to a non-C programmer, but it's much better than a hex dump. Let's pick it apart. The first line,
/* plus5.c add 5 to a commandline argument */is a comment. The C preprocessor comes first and replaces everything from the /* to the */ inclusive with a single blank. Right after the first line is another example of my sloppy coding style. There should be a line saying
#include <stdlib.h>and there isn't. This is because later in the program, we use the atoi() libc call, and we should #include the header file for it. However, this is such a standard #include that gcc includes it for us by default. For a more obscure call than atoi(), or without gcc, this kind of sloppiness will not work. #include is a facility of the preprocessor. It's kindof the opposite of a comment, in that comments get removed, #include's get inserted. The greater-than/less-than around stdio.h is shorthand for "in the standard header files directory" which in cLIeNUX is /source/C/include by default and is /usr/include on most other unices. Next comes
int main(int argc, char * argv[]){Wow. A lot of the trickiness of C, and the main() interface, comes to a head right here. Well, in the interest of brevity, I'm going to ignore it. For the purpose of introduction, just note that this, exactly, is the magic incantation you use to use the commandline facilities associated with main(), and that argv is given to us as an array of pointers to char's.
The { and } in our program define the limits of the body of main(). main() contains one statement. Statements are terminated with a semicolon. Our statement,
return atoi(argv[1]) + 5;says to return from main(), i.e. quit, and to provide the return value of the first commandline argument after the program name, converted from an ASCII string to an int, and added to 5. C converts things in an expression like that from innermost parenthesized group to outermost. First it gets argv[1], which is the first commandline argument. argv[0] is the program name, which we don't happen to use within our program. atoi() converts the string to an int, then 5 is added to that, then return has everything it needs and it does it's thing, and the program is finished.
This is a very sloppy program. If it doesn't get an argument at all after the program name, it segfaults. If it gets an argument it can't convert to an int, it thinks it's 0 and it returns 5. If the argument + 5 is more than 255, it gets truncated to a byte by _exit. But, when you write your own code, you determine whether or not the program is appropriate for the task at hand.
(not real C code) variable declaration/definition . . . function declaration/definition . . . main definitionThat's for an executable. Code for routines to be linked with something else won't have main(). "main", by the way, isn't formally a C reserved keyword, but it's use is ubiquitous. Variables can also be declared inside functions. If they are declared outside any function they are visible to any function in the file. "visible to" means "usable by". If they are declared within a function, they are only visible within that function, but that includes being visible to functions when called within the same function the variable in question is defined in. These issues are called scoping. A variable that is only visible within a function is "local" to that function. There are also "storage class" qualifiers for variables, but I'm not going to address that.
#includes of necessary header files are usually at the top, but the include mechanism works at any point in a file, except within comments, since cpp does comment removal first. The preprocessor can also cause sections of code to be included or omitted depending on variables in cpp's own variables namespace. For any program of any size, various mechanisms will be used to maintain the program in various parts, but at build time it will all resolve to the above general outline by the time cpp hands it to cc1. Routines linked from precompiled libraries are semantically like variable declarations and non-main() function declarations. The structure of a variable declaration is
(not real C code) type [qualifier] name[= initializer] [, name, name...];where [ ] encloses optional material. What you are doing when you declare a variable is initializing storage, which happens before the program starts, so an initializer must be something that can be determined at that point in time, such as a constant number, string, or expression that can be resolved unambiguously. There's a concise definition of "expression" in C, but I don't know exactly what it is offhand. Lets say for now that it's a description of a simple computation to be performed that produces one value. According to the ANSI lingo, if a variable declaration has an initializer it's a "definition".
All actual runtime code must be within functions. Declarations and definitions don't have to be within functions because a program can have initialized data in it's memory image. The structure of a function definition, including main, is this...
(not real C code) return_type name ( [argument...] ) { [statement ; statement ; statement; ....] }Line spacing and indentation are just shown for illustration. Proper formatting is important for readability and maintainability, but sequences of blanks, tabs and newlines are all the same as a single blank to C. Statements may be variable declarations, assignment statements or calls of existing functions. All the statement lines are optional. A function that does nothing is sometimes useful as a stub for future code. All data C handles must have a data type, so a function definition must declare the type of data it produces, that is, the data it returns to it's calling function. GNU C allows function declarations within a function. An assignment statement has the form
(not real C code) ob_expression = expression ;ob_expression means some code that can be resolved to an object, i.e. an entity that can store a value. This is a new meaning of the term "object" within this seedoc. The simplest example of an object in this sense is a variable. Objects are also called lvalues. In other words, the left side of an assignment must represent a storage location of some kind in memory. The = sign is the basic assignment operator. It means when this statement has been performed the object on the left (the lvalue) will contain the results of having performed the expression on the right.
As pertains labels and structured flow controls such as for-loops and while-loops, a statement is a unit of program flow. There are, however, two constructs to control and alter program flow within a statement.
A comma is an operator in some contexts. The comma operator creates a compound expression. For each comma operator, the left side is evaluated, and it's results other than it's value as an expression are asserted, which may include changing values in variables and so on. The right side of the comma operator expression is then evaluated, which may be effected by the side-effects of evaluating the left side, and the value of the comma expression as a whole is the value of the right side. This creates a sub-program within an expression, and is sometimes used for complex behavior where an expression is expected, such as in the loop control specifiers of a for-loop. The comma expression
j = 2, j * 4will evaluate to 8, with the type j has. The term "side-effects" in the above sense is usually applied to functions, and means changes to things besides thier return value.
The conditional operator is an "if" construct within an expression. The format is
(not real C code) expressionC ? expressionT : expressionFThe example represents one expression. It's value is the value of expressionT if expressionC has a value of other than zero (false), and it's value is the value of expressionF if expressionC is 0. The side-effects of expressionC and the other expression evaluated are asserted. In other words, expressionC is the conditional, ? is the true/false test, expressionT is the part to be performed IF TRUE, and expressionF is the part to be performed IF FALSE.
When a function is called, it is entered and it's sequence of statements and it's flow controls are performed until such time as it returns to the caller. Functions nest arbitrarily deep. That is, a function may call a function which calls a function which calls a function etc. etc. A function may call itself; this is called recursion.
Within a single function, there are a variety of flow control constructs available in C to implement conditional execution of code sections and loops of various kinds. The rudimentry, un-structured flow modifier is a goto combined with a label. A label is specified like
labelname:and represents the statement following it. C code in the form of
(not real C code) toploop(){ statement a; target: statement b; statement c; goto target; /* statement d */ statement e; }will do statements a, b, c and d, and then loop endlessly over statements b, c and d. Without other provision for changing the flow, statement e will never be executed and toploop() will never return to it's caller. Endless loops are useful in some situations, such as the top user-interface loop of an interactive program. Statement e in this example is what is called "dead code". The C compiler might optimize it away in the final object code, if optimization is being used.
A goto can go to a label anywhere in the same function. In particular, it can cross the block boundaries of the other flow control constructs I'm about to describe. This has issues. See programming for comments on goto.
STRUCTURED PROGRAMMING
sections and jumps
Several statements enclosed in curly-braces { } are called a
block. A block
is syntactically equivalent to a single statement; a block or a single
statement are interchangeable in the syntax of most flow control
constructs. Blocks, like function definition bodies, do not have a
trailing semicolon. The difference between blocks and the braces in
a function definition is the braces are not optional for a function,
but may be for other constructs.
The break statement will exit several types of flow-control blocks, and is necessary for normal use of the switch/case construct. The continue; statement is used in loop constructs to end the current loop iteration without leaving the loop, i.e. start the next iteration of the loop immediately. The return statement leaves a function, and can pass a value to the calling function it is returning to. Falling through to the end } in a function is equivalent to return 0; . Note that a return statement can be inside a flow-control construct like a while loop.
(not real C code) if ( expression ) statement or block else if ( expression ) statement or block else if ( expression ) statement or block else statement or blockThe expressions are evaluated until one evaluates true (non-zero), or until the else is encountered, and the following block/statement is executed. Then flow resumes after the else part, outside the conditional. Each section but the "if" section is optional. That is, the simplest case is
(not real C code) if ( expression ) statementThis is an example of the generality of { } blocks. The statement following the if clause can be a single ;-terminated statement, or a braces-enclosed block of statements.
The usual case construct is called switch in C.
(not real C code) switch ( expression ) { case constant : statements case constant : statements case constant : statements . . . default : statements }The expression is evaluated, and the case with the matching value for its constant is jumped to. If no case matches then default: is jumped to. The constants in the above may be expressions. They must each evaluate to a unique integer within the set. This is in effect a multi-target goto with numbered labels, where the expression determines which case is the goto target label. Flow does not automatically exit the construct after a case is executed. That means break statements must be used to end atomic cases, or flow will fall through into the following cases. The order of the cases is not rigid, and how you order the cases may effect which case is tested for first, which may effect performance. That is, in the ones I've compiled anyway, a switch construct becomes several discrete tests and branches, and you may want the most frequent cases first.
tested loops
The while-loop construct tests an escape condition at the beginning of
each iteration of the loop. The do-while loop construct tests the escape
condition after each iteration of the loop. do-while is often used when
a loop is intended to always iterate at least once.
(not real C code) /* while loop */ while ( expression ) statement or block /* do-while loop */ do statement or block while ( expression ) ;counted loops
(not real C code) for ( init_expr ; test_expr ; incr_expr ) block/statementHere's an example program using a for-loop...
/* for.c for-loop demo */ #include <stdio.h> /* declare printf from libc */ int i; /* we need a loop increment var. */ main(){ /* program takes no arguments */ for (i = 0 ; i < 30 ; i = i + 1) /* for 0 thru 29, count by 1's */ { printf ("%d ", i ); /* print the count as a decimal number, with some trailing blanks */ } printf("\n"); /* print a newline when done looping */ } /* end program, use default return */That's fairly plain-vanilla C. It differs from most code in that the comments are a bit verbose, (and crunched a bit for html,) since normally one could assume the reader knows C. Also, "i = i + 1" is usually expressed with the C increment operator ++, e.g. i++ . The curly-braces around the single-statement for-loop body are unnecessary, but typical for clarity. The indentation style is what I use. The documentation for all the libc calls including printf are not in cLIeNUX Core, but they are in a package. printf is almost a language unto itself, with lots of formatting and conversion options. Paste the above into a text file named for.c, gcc for.c, run it, change it, make it do something clever.
The basic types in C are void, char, int, and float. A void object has no size, and is sometimes useful with "pointers", addresses of other objects. In other words, void isn't nothing, it's an address of something of un-specified size or type. char is, in practical terms, a byte. int is usually the same size as a machine address in a particular implementation, which on Linux x86 is four bytes. float is a floating-point number.
Possible qualifiers of the above types include unsigned, short, long (for ints), double (for floats), and signed (for chars). The const qualifier states that the object is constant, and the volatile qualifier says that the object's value may be changed by something other than the program. volatile may be a necessary qualifier when an object represents an input/output port of some kind, for example.
Data types are part real and part abstraction. An actual storage location for a datum, an lvalue, has a certain size. That's a very real constraint. If your C code tells the compiler a variable is an int it allocates 4 bytes for it (on x86). If you declare it unsigned, then that's an abstraction, and effects how the data is handled, but it's still 4 bytes. Sizes of things are a matter of physical reality, but typing information more specific than that is entirely a service of, and internal to, the compiler.
Situations often come up where you want to add two integral types of different sizes. C will do a lot of different type conversions if situations arise where it seems OK to do so. Usually what is allowed is a "promotion", from char to int for example. If you add a char and an int, the value produced will be carried around in the compiler's idea of things as type int, which is the conversion, a promotion, that results in no loss of information. That is, an int holds all the bits of a char without losing any.
(type) expressionThat means that declared types of things are thier defaults, but you can do just about anything to them, as may be desirable. There are a lot of possible conversions though, and what happens when you cast e.g. a float to type unsigned int is something you had better check in your particular C implementation if you need such strange behavior.
( & my_variable )evaluates to, or returns, or is seen by the compiler as, the address and type of my_variable. Let's say you have an int variable called fake_pointer. If you do
int fake_pointer; fake_pointer = (int) & my_variable;that statement will result in the contents of fake_pointer being the address of my_variable, so you've created a pointer. You've lost some information though, or rather the compiler has lost some information. When you stored & my_variable in an int, (which you'll get compiler warnings about if you don't do the cast to type int,) the compiler lost track of the datatype of my_variable. All you stored was my_variable's address. You can keep track of types yourself and handle the necessary conversions with casts, or you can declare variables to be pointers to objects of some type.
Given an address in a variable, you need some means to obtain the object that address points to. This is called "dereferencing". The name of a similar operation in the parlance of the Forth programming language is "fetch", which I think is rather intuitive. A thing that points at another thing is also known as a "degree of indirection". The C fetch or dereference operator is unary *. That is, * not in an arithmetic expression, but rather preceding the name of a pointer. I suspect that perhaps one of the really confusing things in C is that & and * are not exactly symmetrical. This is because of data typing. You can't directly fetch something with an int, because that doesn't get you a datatype for the pointed-to object, which in most cases is useless, so C doesn't allow it. You can fetch something with an int though, with a cast. What you are doing with the cast is providing information C needs to keep track of types. Because of operator overloading, because e.g. + is various operations for various types, types have to be kept track of by you, with casts, or by C, based on declarations.
int fake_pointer, my_variable, other; my_variable = 77 ; fake_pointer = (int) & my_variable; other = *(int *)fake_pointer;"other" now contains the the same value as my_variable, 77, but was passed that value using just addresses. It also happens to have the same type, int. Doing it that way means that you, the programmer, kept track of the datatype. Sometimes you may want to do that, usually you don't. I usually do, but I'm weird.
Casts can be pretty arbitrary, especially in gcc, and arbitrarily complex. Casts bind right-to-left, so the *(int *) is a cast to pointer to int, specified by the (int *), followed in time sequence by a fetch or dereference, specified by the *. That's the minimum you have to do to an object declared int to dereference it as a pointer.
More typically, and more conveniently, but maybe not as clearly, you can declare variables specifically for pointers as type "pointer to [type]". A declaration of a variable to hold the address of another object of type float, for example, would be
float * my_float;That creates an object the size of a machine address that is considered to be the address of an object of type float. The compiler then handles my_float and the object it points to various ways depending on context.
/* pointer_demo.c, pointer values and pointer net values */ main(){ int a, b ; /* declare a couple ints */ int * p, * q; /* declare a couple pointers to ints */ a = 777; /* give our int a value */ p = &a; /* set what p is pointing at */ q = p ; /* copy a pointer to a pointer. The address is copied. */ printf("%d\n", * q ); /* print the object/net value the copy of the pointer points at */ b = 4; /* initialize our other int */ *q = b ; /* change the value of what q is pointing at */ printf("%d\n", *q ); /* print the contents of what q points at */ printf("%d\n",a ); /* print what we set to 777, and then reset to 4 indirectly via a pointer */ }A pointer to void is an address, but the pointed to object has no size. Pointers to void are used at times to handle pointers that you want to point at different types of objects at different times in the program.
char *stringy_thingy; stringy_thingy = "My string\t\t\t for illustration\n" ;That declares a pointer to type char, and then initializes it to the string literal shown. The actual value stringy_thingy will then contain is the address of the "M" in "My". \t is the C string literal escape to include an ASCII tab byte in a string, and \n represents a newline. This convention is reflected in most unix programming languages. When you declare a string literal C puts a zero on the end of it, and then the various library routines and so on can traverse the string from it's pointer address up to the zero. This is the case with stringy_thingy, and with the printf format control construct %s. For example,
/* string.c demo of null-terminated string */ char * string = "blah blah woof woof " ; main(){ printf("%s %s %s \n\n\n", string, string, string ); }A struct is an arbitrary grouping of data. The struct mechanism allows an arbitrary data grouping to be replicated. Structs also implement a hierarchical naming scheme for data somewhat like the pathnames of a filesystem. A union is an object that can be accessed as more than one type.
/* data_toy.c play with a struct within a union */ union convrt { char b[20]; struct clump { int header ; int body[4]; } clmp ; } convert ; main() { int i ; convert.clmp.header = 5555; for (i = 0 ; i < 4 ; i++) convert.clmp.body[i] = i * 4444 ; for (i=0; i < 20 ; i++) printf("%d ", convert.b[i] ); printf("\n"); }I didn't comment this one, because I'm going to unravel it here. The mess above main() is a union definition. The union is named convert, and is a union of a struct and an array of chars. The definition of the struct is contained right in the definition of the union. Recall that runtime code can't be outside a function. The things within the struct and union definitions that look like statements, i.e. that are ended by a semicolon, are the individual components of the compound structure definition. Unions and structs have the format
(not real C code, [] encloses options) struct classname { type fieldname; [ type fieldname; . . . ] } [ instance_name, ... ] ;The format for a union would say "union" where the above says struct. If there are no instance names then it's a declaration. If there are instance names, instances of that type are defined and allotted storage. For a struct, the system makes a data format to keep the fields organized. For a union, the fields actually overlap in memory. In other words, a union provides a physical location that you can manipulate as various types. Unions, like casts, are an escape-clause of sorts for C typing.
Looking back to data_toy.c, our array of chars overlaps our struct. That means we can fill the struct as the data types it's defined as, and look at it as consecutive chars, unsigned bytes. This is a form of low-level data conversion. By running data_toy.c you can see how C types are actually stored as bytes.
What else did I introduce in data_toy.c? Too much, actually. Well, the struct/union equivalent of the / in the unix file namespace is a period. Fields in a struct are a compound name of the form struct.field, similar to dir/dir/file in a unix filesystem. Also, this time I gave the sizes of the arrays at declare-time. A union will be the size of it's largest field, but the declaration has to know what size that is.
An enum is a sequence of names for sequential constant ints. I don't find enums very useful.
We can divide the directives to cpp into two classes by what text causes them; special and named. Comment removal and line joining are caused by specific simple character sequences in the input. You know about /* and */ around comments. Line joining is caused when a line ends with a \ and an immediately following newline. \ allows to you make multi-line cpp directives.
Named cpp directives all require a # prefix. #include is one. The prefix must be the first non-whitespace character on the line. There may be whitespace between the # and the name. Here are six legit named cpp directives...
#includeIf the above were in a C source file, when processed by cpp, the files ./app_local.h and the stdlib.h in the standard system header files would be inserted into the file, the macro PI would be converted to the string 3.1415926 anywhere it occured in the file, and the conditional action of the next 3 lines shown wouldn't have any effect on the output, since 0 is false. cpp doesn't do any math. 0 and 3 and 3.1415926 are just text strings to cpp. Actually, I'm not sure about 0, but the point is cpp is all about text-to-text. PI in the above is a macro, text. The prevailing convention, and a good one, is to use all-caps for cpp macros in .c files.# include "app_local.h" #define PI 3.1415926 #if 0 # define PI 3 #endif
Macros can be constructed with syntaxes, and that take arguments. A macro #define has to use the aforementioned line-joining mechanism to be longer than one physical line. For a macro to take arguments, it has a syntax roughly like a C function definition, and the ( opening the argument list must follow the macro name immediately. The following is legit cpp. cpp also does some checking of whether it's legit C or not. Run cpp on this.
/* twiddle.c cpp demo, real but rather bogus C. */ int first, second, third; # define TWIDDLE(A, B, C) C = B + A + C ; \ "continued line" ; TWIDDLE(first, second, third)Note that the \ logical line continuation operator only works as such if the very next character in the file is a newline. If there is other whitespace between the \ and the newline it doesn't work. This is one of my pet peeves with unix tools; some things that are invisible have important meanings. "make" and sh-style shells have this mis-feature also.
This file is about one tenth the size of "The C Programming Language" Second (ANSI) Edition, Kernighan and Ritchie, the standard book on C by the authors of the language, which I have had sitting in front of me for this. Trying to present C in this much space is of course absurd. I do think I've given a bit more bottom-up presentation than that work, and I do think there's enough info here to write small useful programs.
C is a third generation language. Various parties represent other languages as fourth-generation. The actual fourth-generation language is Forth, circa 1971. Do as I say, not as I did, and learn C before Forth. Then learn Forth.
RIGHTS
Copyright 1999 Richard Allen Hohensee
This file is released for redistribution only as part of an entire intact
cLIeNUX Core.