Programming Guide

From WoxWiki

Jump to: navigation, search

This page contains information on coding style you should use while writing your programming assignments (based on Linus Torvald's style guide distributed with the Linux kernel source code, an interesting read, but our support code definitely does not adhere to everything in it). Style will be part of your grade on all projects in this class. We try not to enforce strict guidelines, but unless we have some standards it is inevitable that students will hand in unintelligible code. If this is the first time you are writing large amounts of C code, it is important that you practice good style right from the beginning before you pick up bad habits.

If you don't understand the reasoning behind something in this guide or feel that there is some special case in which following these guidelines actually makes your code less readable, then feel free to ask the TAs about it, this guide is by no means perfect.

If you have a question about something not addressed by this guide or want to read more about C then we refer you to K&R. There should always be a copy in the back of the Sunlab.

Contents

Use of Library Functions

Functions such as printf(), malloc(), and strlen() are examples of external library functions often called in programs. Each CS167 and CS169 assignment has specified rules for which library functions may be used. Do not use library functions that are not specifically allowed for a given assignment. (This will most important in the Shell assignment which has a whole section on which calls are allowed.) Once you get to Weenix (VFS for CS167 and kernel for CS169) you should notice that calls to library functions no longer work because there are no longer any libraries for you to call. Weenix is your entire operating system.

Indentation

Indentation is an extremely important part of good style. We cannot emphasize enough how much consistent indention will help the readability of your code. If you have programed in any language before (if you haven't you should consider a different course) you will probably have some experience with using indentation to make your code more legible. In C there are just a few rules to follow:

  1. Any time you start a new block of code (a "block" of code is started with a { and ends with the matching }, so for example a function deceleration is a block of code, the code in an if statement is a block of code) you indent one indentation more than the containing block of code.
      int main(int num)             /* Outer most code is *NOT* indented */
      {
          num++;                    /* Code in the function's block is indented *ONCE* */
          if (num < 0) {
              return 0;             /* Code in 'if' statement block is indented once more than the 'if' statement itself */
          } else                    /* Curly bracket to end block is back to only *ONE* indentation */
              return -1;            /* This line of code is technically also in a block even though it has no brackets */
      }
  2. Continuations (code that is only one line of code but is too long to fit on only one line) should be indented (You might want to indent continuations more than you would indent a block of code in order to make it stand out more, this is fine, what is important is that all continuations have the same sized indentation).
      int main(int num)
      {
          printf("This is an extremely long line of code and "
                  "it would be ugly to let it line wrap.";
          /* The previous line is indented twice as much as a normal indent because it is a continuation */
      }
  3. Any code which is in the same block should be indented to the same level. This should be implied, but just to be sure here are some examples of what is wrong.
      int main(int num)
      {
          int magic_number;
     
          /* Example 1: Indentation sizes should be consistent. */
         num++;                     /* This line is indented 3 spaces */
          num++;                    /* This line is indented 4 spaces */
     
          /* Example 2: Closing brackets should be have one indentation less than the block they contain */
          if (num > 0) {
              return -1;
              }                     /* This bracket should be indented only once */
                                    /* This is a rather easy mistake to make if your editor does not
                                     * automatically indent for you, but it makes it very hard to match
                                     * brackets and therefore makes your code hard to read */
     
          /* Example 3: Some people might be tempted to do something like this */
          magic_number = magic_function_start(num);
              magic_number = magic_number - num;
              magic_manipulator(magic_number);
          magic_function_end();
          /* The rationale behind this might be that magic_function_start() and magic_function_end() are
           * somehow the beginning and end of a "block" because magic_function_start() starts a process,
           * then you perform some operations and then magic_function_end() ends the process. While this
           * might make sense to the person who writes the program, others who do not know what 
           * magic_function_start() and magic_function_end() do will just be confused by the 
           * extra indentation. */
      }

On a final note you will notice that we never referred to "indentation" as "tabs" or "X spaces". This is because programmers will argue forever about what sized indentation is "correct", but we don't really care as long as you pick a size and stick to it. We would like to note however, that most good text editors these days let you specify how large tabs should be. Therefore if you use tabs to indent your program people who want indentation to be 8 spaces can set tabs to equal 8 spaces while people who want 2 spaces can set their tabs to be 2 spaces and both will be happy when they view your code.

Emacs Notes

emacs by default indents by 2 spaces (and uses spaces, not tabs) when you press tab. This can make code very dense and hard to read and it is not immediately obvious how to fix this. You can force emacs to use tabs for all C code indentation by placing the following lines in your .emacs file in your home directory:

(global-set-key (kbd "TAB") 'self-insert-command)
(setq tab-width 4)
(setq my-build-tab-stop-list tab-width)
(setq default-tab-width tab-width)
(setq c-indent-level tab-width)
(setq c-basic-offset tab-width)
(setq indent-tabs-mode t)

You can change the number 4 to whatever size you want tabs to be.

You can also try out the other default styles which emacs offers using the following line:

(setq c-default-style '((java-mode . "java") (c-mode . "linux") (other . "gnu"))) 

This line sets the default style for c-mode (which emacs should automatically enter when you open a *.c or *.h file) to be the coding style used in the Linux kernel (including much larger indentation). You can use similar syntax to change the defaults for other modes. Another style you might want to try is "k&r" which uses the same coding style used in K&R. The default settings for emacs are to use the "java" style for java-mode and the "gnu" style for everything else.

You can auto-indent a region of C code to clean it up if you mess up the indentation by highlighting the region and running M-x indent-region if you made the changes above to your .emacs file then this means everything will be indented with tabs properly.

Placing Curly Brackets

Another style guideline which can affect a large portion of your code is where you place curly brackets ({}).

  1. You should place the opening curly bracket on the same line as the statement that starts the block (if, while, do, etc.).
      if (x) {                      /* Curly bracket on this line */                       
          return 0;
      }
  2. The closing bracket should be on a line by itself unless it is followed by an else statement or the while in a do-while loop, in which case it should be the first thing on the new line, followed by the statement.
      if (x) {
          return 0;
      } else {                      /* Curly bracket is first on the same line as 'else' */
          return -1;
      }
     
      do {
          count++;
      } while (condition);          /* Curly bracket is first on the same line as 'while' */
     
      if (x) {
          return 0;
      }                             /* Curly bracket is on a line by itself because there is no 'else' */
      while (condition) {
          count++;
      }
  3. The style used in K&R has one exception to this rule: functions whose starting curly brackets appear by themselves on the line following the function header. We won't hold you to this as we think it is too picky. If you are curious about the rationale you can read the copy of K&R in the back of the Sunlab. (It is somewhere towards the beginning, basically in an older revision of C there used to be things on that line and now there is not, but it is still left blank.).
      int main(int num)
      {                             /* Curly bracket is on a line by itself */
          return 0;
      }

Naming Functions and Variables

The first thing you need to realize about C (especially if you are coming from something like a Java background) is that "global" in C really means global. If you have a global function initialize() then that is the only function in your entire program which can have that name (unless you use the static keyword). If you try to declare two functions with the same name in the same file you will get a compiler error, if they are in different source files you will get a linker error, either way your code will not compile. For this reason we frown on global functions and variables with vague names like initialize. The simple solution to this (adopted in the Weenix source code) is to prefix all functions in a file with some descriptive string such that they will not conflict with similar functions in other files. When working on Shell you should not need to worry about this since the program will be small and simple, but starting in DB you will find this practice in use in the support code (e.g. every data type in [[Assignment/DB|DB] has an initialization function, for node_t it is called node_initialize, for client_t it is called client_initialize, etc.). (A similar standard could be applied to fields in a struct, it is a good idea to prefix the names of the fields with something that identifies them as belonging to that type of struct so that when you use code navigation tools like Cscope to find references to them the name will be unique.)

If you have a function which does something very specific to the current file (e.g. a helper function for a more complex global function) you can use the static keyword. This marks the function as local to the current file and it cannot be referenced from other files (this is significantly different from what the static keyword does in Java). Even if functions in other files (also static) have the same name there will be no problems. Good use of the static keyword can create "private" functions which help clean up your code without creating too many "global" functions. Remember that it is still important to give static functions meaningful names.

The other type of variable to worry about is local variables. These are variables defined inside of function which only exist within the scope of the functions. Since these variables are only visible within the function you should use short and simple names (the opposite of global variables). For example a temporary variable can be called tmp, an integer iterated during a for loop can be called i, and a pointer to a database can be called db. You should be able to use single-word names or abbreviations and they should be perfectly understandable. If it seems like longer names are necessary in order to make your code readable then you have another problem and we refer you to the #Functions section.

Functions

This section is taken quoted directly from Linus Torvald's original document because he said it perfectly:

Functions should be short and sweet, and do just one thing. They should fit on one or two screen fulls of text, and do one thing and do that well.

The maximum length of a function is inversely proportional to the complexity and indentation level of that function. So, if you have a conceptually simple function that is just one long (but simple) case-statement, where you have to do lots of small things for a lot of different cases, it's okay to have a longer function.

However, if you have a complex function, and you suspect that a less-than-gifted first-year high-school student might not even understand what the function is all about, you should adhere to the maximum limits all the more closely. Use helper functions with descriptive names (you can ask the compiler to in-line them if you think it's performance-critical, and it will probably do a better job of it than you would have done).

Another measure of the of the function is the number of local variables. They shouldn't exceed 5-10, or you're doing something wrong. Re-think the function and split it into smaller pieces. A human brain can generally easily keep track of about 7 different things, anything more and it gets confused. You know you're brilliant, but maybe you'd like to understand what you did 2 weeks from now.

Commenting Standards

While we don't want to lay down stringent guidelines for commenting, we'd like you to keep the following in mind:

  1. The goal of commenting is to make your code more readable and comprehensible to other people, as well as yourself. The better we understand and appreciate the intricacies of your program, the better your chances of getting a good grade.
  2. Particularly in CS169, you will be reusing in later assignments a lot of code written in the earlier assignments. Good commenting at this point will undoubtedly make your code easier to maintain.
  3. On the flip side of the coin, we (your TAs) are (reasonably) intelligent people. Comments like:
    current = current->next;    /* Advance current */

    are superfluous and aren't really going to help us in any way. (If you believe that kind of comment is necessary for us to understand your code then there is something wrong with your code.)

With that in mind, we'd like to lay down the following loose and simple guidelines:

  1. Write and hand-in in with your source code a detailed README file which describes which features you have implemented, which features you have not implemented, known bugs in your code, and an overview of the design of your program. The README file is the primary documentation the TAs will refer to when grading your program.
  2. Document complicated code fragments (complicated by necessity, for example you used a complicated algorithm, not because you didn't take the time to design it properly and it doesn't make much sense), describing your algorithm if it is especially obfuscated.

Use of Literals

You probably learned in previous CS courses that using literals (e.g. 123, "A string") throughout your code can be a bad thing. For example, the following code is rather error prone:

    int *array;
    array = malloc(sizeof (*array) * 123);
    memset(array, 0, sizeof (*array) * 123);

Because the literal 123 appears in two locations it will need to be changed in both locations if it is changed in either one, however there is no compile-time warning about this and it can lead to very hard to track down bugs if the two numbers are not equal. Also, avoiding literals can improve the readability of your code by giving real meaningful names instead of numbers. In Java you would define a constant like this:

    public static final int ARRAY_SIZE = 123;

In C you would use a macro to get a similar effect.

    #define ARRAY_SIZE 123

Note the lack of a type name or equals sign. This is what is known a compile time macro. This line tells the compiler, "any time I use the word ARRAY_SIZE in this program replace it with the string 123 before compiling". Any time you reuse the same literal in more than one location within your file you should use the #define for it.

It is important to remember that the compiler is effectively using string replacement to do this. As an example of why this is important consider the following example:

    #define WIDTH 3
    #define HEIGHT 4
    #define SUM WIDTH + HEIGHT
 
    int my_function()
    {
        return 5 * SUM;
    }

In this example you might expect my_function to return 5 * SUM = 5 * 7 = 35, but since the compiler is actually using string replacement it will actually evaluate as 5 * SUM = 5 * WIDTH + HEIGHT = 5 * 3 + 4 = 15 + 4 = 19.

If you have many constants which are logically grouped together you might also look into using C enums to declare them. C enums are much less powerful than Java enums. They are just typedef-ed integers with have special names associated with certain integer values.

Memory Management

Memory management can be one of the trickiest parts of C.

Introduction

To help motivate the discussion in this section let start with the differences in memory allocation between Java and C.

  • In both languages local variables are stored on the stack.
  • In C the malloc(3) function allows you to allocate (reserve) a certain number of bytes of memory on the heap. The closest to this in Java is the new operator, but there are some very important differences:
    • The new operator can only allocate space for a specific object type (the type determines the size of the allocated memory). The malloc(3) takes the size of the region to be allocated as an argument, and once the region is allocated anything can be written to it (not just data for a specific object).
    • The new operator throws exceptions when something goes wrong, the malloc(3) function returns NULL.
  • All memory allocated using the new operator is freed automatically by the Java virtual machine once it is on longer in use. In C memory allocated using the malloc(3) function must be freed explicitly using a call to free(3) before that portion of memory can be reused. This means that if you keep allocating memory without calling free(3) it you will eventually run out of memory (this might be caused by a memory leak). The free(3) function takes a single pointer as an argument. This pointer must have been returned by a call to malloc(3) at some point, but you cannot free the same pointer twice. You can safely pass a pointer to NULL to the free(3) call, it will do nothing.
  • In Java when you use the new operator the fields of your new object instance are all initialized to some well defined values. In C, there is no guarantee of what data will be in the newly allocated memory. In general you will find that if this is the first time a certain region of memory has ever been allocated it will be filled with 0's. If the region has been used before (but was freed and then allocated again) it will either contain the data from the previous time it was allocated or 0's. Therefore, it is very important that you initialize any memory you receive from malloc(3) before having any expectations about what values are stored in that memory. This comes up often when allocating arrays or strings - you cannot assume they are filled with 0's!

Common Issues

There are several issues which can arise when managing memory using malloc(3) and free(3):

  1. For every time that your program will call malloc(3) you MUST make sure that a corresponding call to free(3) will be made to free the memory which was allocated by malloc(3). We cannot stress enough how important this is. If your program does not do this it will eventually run out of memory. Note the important property that the number of calls to free(3) must equal the number of calls to malloc(3), there is no way to free the memory allocated by several calls to malloc(3) with a single free(3) call.
  2. It is possible for you to "lose" a pointer to allocated memory. If you do this it becomes impossible to free the allocated memory (or to use it, so it is effectively just wasting memory, and is referred to as a memory leak). See the following piece of code for an example of this
        char *str;
        str = malloc(sizeof(*str) * STR_SIZE);
        str = NULL;     /* str was the only record we had of what address malloc allocated
                         * memory for us at, by overwriting this address we have forever lost
                         * the information about where the allocated space is, so we can never
                         * access it. By extension we can never call free(3) to free this memory
                         * because free requires a pointer to the memory region as input. */
  3. As mentioned in the previous section, C makes no guarantees about what data will already be in memory at the location returned by malloc(3). If you wish to do any initialization of the data you must do so manually after you call malloc(3). If you wish to set all of the memory to 0 you may use the following pattern:
        char *str;
        str = malloc(sizeof(*str) * STR_SIZE);
        memset(str, 0, sizeof(*str) * STR_SIZE);

    Failure to pay attention to this problem can lead to very obscure problems. A function which uses malloc(3) may work most of the time if malloc(3) often allocates memory regions which are filled with zeros, but the function could non-deterministically fail whenever malloc(3) allocates some memory which is not filled with 0's.

  4. After a call to free(2) you may still be able to access the memory region you just freed. In fact it probably still contains the data that it had before you freed it (and you can still alter it as you wish), therefore it can be hard to detect errors where you continue using memory which is no longer allocated. You program might seem to be working fine for quite some time before that memory is finally overwritten by something else. See this example:
        /* allocate an array of integers */
        int* status, buffer;
        status = malloc(sizeof (*status) * ARRAY_SIZE);
     
        /* fill the array with data and do our calculations... */
        /* we think we are done with the array so lets free it */
        free(status);
     
        /* do some more calculations with status, BUT WAIT! we already
         * freed status so we shouldn't be doing this! unfortunately C
         * will not warn you about this, so right now our code is playing
         * with memory that it has no business touching anymore, if our
         * program where to exit here we might not even notice this problem,
         * but this is actually even worse than if the program where to
         * continue executing until it reaches the error state we are about
         * to describe, just because we exit before we notice the error does
         * not mean the error does not exist, it is only by luck (or bad
         * luck depending on how you look at it) that our program could keep
         * running in this state without error, any number of tiny changes
         * to our code could break the program */
     
        /* now we need another array */
        buffer = malloc(sizeof (*buffer) * BUFFER_SIZE);
        /* since we freed the memory used by status before malloc decides
         * to reuse part of that memory for this buffer */
        /* perform some more operations on status, everything still seems okay... */
        /* perform some operations on buffer, OH NO! we just overwrote the data
         * we were storing in the status array! but we still have not noticed... */
        /* now we try to use the status array again, but wait! what is this? the data is
         * totally inconsistent, how did this happen? how will we ever debug this? */
  5. In C, local variables go out of scope once you return from the function they are declared in, however, similarly to the problem we saw with freed memory regions, the value of local variables in memory remains unchanged until it is overwritten (usually by some function call which pushes more information onto the stack and causes the local variable's old data to be overwritten). This means that if we have a pointer to some data stored on the stack, we might be able to keep using it for quite some time before noticing the problem. Here is an example:
        char* get_input()
        {
            char buffer[BUFFER_SIZE];
            read(0, buffer, BUFFER_SIZE);
            return &buffer[0]; /* NOOOOO! BAD! NEVER RETURN POINTERS TO DATA ON THE STACK! */
        }
     
        int afunction()
        {
            char* input;
            input = get_input();
            /* input is now pointing to the address of the buffer array from the function call,
             * but buffer is a local variable which went out of scope when the call to get_input()
             * returned, so it is not memory that we should be accessing, as in the previous example,
             * if our program were to exit here we might not notice the error, but not noticing the
             * error does not mean it does not exist */
     
            /* now we make a bunch of method calls which are pushed onto the stack, overwriting the
             * data pointed to by input */
            /* now we use the data pointed by input again, but wait! its full of seemingly random data! 
             * where did we go wrong?! */
        }

Defensive Coding

If you read the list of common problems above and thought, "wait a minute, this makes it sound like memory management problems in C are easy to make and impossible to track down," then you got the right impression. Having control over all memory and memory allocation in your program can be extremely useful (and necessary in some cases) but it is very error prone if you are not used to managing memory properly. While tools such as Valgrind can go a long way towards tracking down some of these problems, the only sure-fire way to keep things working is adhering to some strict coding guidelines. Here we list some very common techniques used to keep things working correctly. In general you will not lose points if you do not follow these guidelines exactly, however you will lose points for any improper memory handling.

  1. The best way to avoid memory leaks is to have a good way of assigning "responsibility" over a section of allocated memory to a certain piece of your code and making sure that the "responsible" section of code always frees the that memory. We suggest you do this by assuming that any method which contains a call to malloc(3) does one of two things:
    1. Also contains the corresponding call to free(3). In this case each method is "responsible" for all memory it allocates.
    2. Only functions as "constructor" for some data, by allocating and initializing space for it, and there is some corresponding "destructor" function which contains the corresponding call to free(3). The two functions should have names which emphasize this relationship (such as obj_init(...) and obj_destroy(...) and the procedure for proper cleanup should be well documented. The DB assignment uses this method to allocate most structs. This method works best for initializing complicated structures (or maybe even multiple structures) which may require multiple calls to malloc(3) and complicated cleanup.

    Note that adhering to one of these rules is pretty much required to avoid memory leaks in larger programs. Most standard C libraries avoid calling malloc(3), so you only need to worry about freeing memory which you directly allocate.

  2. If you have a function which needs to return a lot of data (e.g. a string, an array, a large struct) you should require that the caller provides the memory in which you can write the result. Many system calls provide good examples of this. The read(2) system call requires the caller to provide a buf argument which should point to a region of memory where read(2) can write the data it wants to give to the caller. The getdents(2) system call requires that the caller provides a pointer to a region of memory where it can store a struct dirent. Using this method leaves the job of memory management up to the caller and it is unambiguous whether or not the caller needs to free any memory after calling your function.
  3. After every call to free(2), set the pointer which you just freed to NULL:
        int* array;
        array = malloc(sizeof (*array) * ARRAY_SIZE);
        /* use array for some stuff */
        free(array);
        array = NULL;

    This will stop you from accidentally using pointer again and messing around with memory you have already freed. This is not fool-proof (you could have another pointer sitting around), but it is helpful.

  4. If you have a complicated struct, provide some function which initializes it (it is always good to have a function which uninitializes it as well). Whenever you allocate memory for such a struct you can call the initializer to set any fields which need a certain initial value. By centralizing the initialization in a single function you also make it easier and less error prone to add new fields to your struct. If you add a new field which needs a certain initial value you can set the value in the the initialization function. Conversely if you have a field which requires some work to clean up (such as a pthread_t you want to cancel) you can do so in the uninitializer function.

C Strings

Strings in C are very different from their Java counterparts. You will notice there is no actual "string" type in C. This is because strings are represented by the char* type. This pointer points to the first character in the string. All characters following the initial character in memory are also assumed to be part of the string until a null character ('\0') code is reached. This leads to several important consequences:

  • If you wish to create a buffer large enough to hold a string with n characters, you will need a buffer of size n + 1, so that the extra element can be used to hold '\0'.
  • If you use functions such as read(2) or memcpy(3) which are not aware of string conventions you will need to make sure to handle the '\0' correctly yourself (when calling read(2) to read a string you will need to add the '\0' to the end of the string, when calling memcpy(3) to copy a string you will need to make sure that you copy the '\0', note that the standard definition of "string length" does not include this character so make sure you are taking it into account).
  • Any function which takes a "string" as input (e.g. strlen(3), strncpy(3), printf(3)<code>) will assume that the string is terminated with a <code>'\0'. If the string is not null terminated then the behavior of the function is undefined and could range from memory corruption to segmentation faults.
  • Zero-length strings are pointers to NULL ('\0' = NULL), they cannot be NULL themselves. Passing null pointers to the standard string libraries will cause segfaults. As an example this may segfault:
        char* buf = NULL;
        int len = strlen(buf);

    While this will correctly set set len to 0:

        char buf[128];
        buf[0] = '\0';
        int len = strlen(buf);

Handing in Assignments

Hand in your (working) source and README electronically. For this, you should run the following commands:

 $ make clean
 $ /course/cs167/bin/cs167_handin (asgn)

where (asgn) is the name of the assignment you are handing in. Look for details in each assignment hand-out. CS169 students should hand in their assignments as well using cs169_handin instead.

Personal tools