Last update Jul 23, 2002
C's primary disadvantage has been its complexity; optimizing C compilers are often forced to avoid code improvements that FORTRAN compilers can generate with confidence. However, Digital Mars' optimization technology combined with C++'s improved type-checking enable generating numerical applications that rival the performance of equivalent FORTRAN programs.
Digital Mars has implemented nearly all of the numerics additions to C defined in C99. These additions are also present as extensions to C++98.
This chapter is a guide to understanding and using the C and C++ numerics support by Digital Mars.
Note: The char type is signed by default. The -J compiler switch, however, changes char so that it is unsigned. When using 8-bit integral quantities, it is best to explicitly define values as signed char or unsigned char, depending upon your needs. The char type should be used primarily for working with characters and text strings. char, regardless of whether it is implicitly signed or unsigned, is a type distinct from signed char and unsigned char. Type compatibilities are covered in more detail in the section Using Mathematics.
15 decimal 15 015 octal 15 = decimal 13 0x15 hexadecimal 15 = decimal 21 0b1011 binary 1011, decimal 11Octal constants may contain only the digits 0 through 7, decimal constants contain the digits 0 through 9, and hexadecimal constants contain the digits 0 through 9 and A through F. Letter digits in hexadecimal values can be in either uppercase or lowercase.
Literal constants have a type. Unless your program specifies otherwise, a literal constant has an assumed type based on its size. A decimal constant is given the first of these types into which its value can be stored: int, unsigned int, long, unsigned long. An octal or hexadecimal constant is typed int, unsigned int, long, or unsigned long, depending on its size. Some examples of constants and their types are:
1701 int 42000 unsigned int 3117426591 unsigned long 0x7F int 0157152 unsigned int 0x49310 longThe type of a literal integral constant can be affected by a suffix. If a u or U is used as a suffix, the constant is unsigned. If a l or L suffix is applied, the constant is long. If both suffixes appear, the constant is an unsigned long. Some examples of how suffixes can affect the default typing of the constants given above are:
1701u unsigned int 42000l long 0x7Ful unsigned long 0157152l long 0x49310u unsigned long
The float and double types implement the single-and double-precision floating-point formats defined by the Institute of Electrical and Electronic Engineers (IEEE) standard 754-1985. A float is a 32-bit value, a double is 64-bits, and a long double is 80 bits. These bits in a floating- point value are divided into three components: a sign bit, an exponent, and a significand. Figure 28-1 shows the internal format of the types. The s indicates the sign bit.
Figure 28-1 Floating point data type formats float: s exponent significand 31 30----23 22--------0 double: s exponent significand 63 62----52 51--------0 long double: (Win32 only) s exponent 1 significand 79 78----64 63 62--------0The highest-order bit in a floating-point value is the sign bit. If the sign bit is 1, the value is negative; if the sign bit is zero, the value is positive.
In a float, the exponent occupies 8 bits and the significand uses the remaining 23 bits. A double has a 52-bit significand and an 11-bit exponent. In addition, the significand of float and double values has an implicit high-order bit of 1.
The significand holds a binary fraction greater than or equal to 1 (because the implied high bit is 1) and less than 2. The number of bits in the significand affects the accuracy of the floating-point value. A float has 6 decimal digits of accuracy, and a double (with its longer significand) is accurate to 15 decimal digits. Since the significand is a binary fraction, it cannot always exactly reflect a decimal value stored in it. For example, there is no binary fraction that can exactly represent the values 0.6 or 1/ 3. Floating point numbers represent an approximation of a decimal value, from which rounding errors come. The section Using Mathematics covers precision, significant digits, and rounding in detail.
The exponent is a binary number that represents the number of binary digits the significand is shifted left for a positive actual exponent or right for a negative actual exponent. The exponent is a biased value; you calculate the actual exponent value by subtracting a bias value from the exponent stored in the value. The bias for a float is 127; the bias for a double is 1023. Thus, a float value with an exponent of 150 would represent a number with an exponent of 23.
The following table shows the approximate minimum and maximum values that can be represented by the double and float data types.
Data type Lowest value Highest value float 1.175494*10-38 3.402823*10+38 double 2.225074*10-308 1.797693*10+308
0.0 zero 3.141592656 pi 1000.0 one thousand 1000. one thousand 1.E+3 one thousand 5.e-3 five one-thousandths 0.005 five one-thousandthsThe value of the exponent indicates the number of digits that the decimal point should be shifted. A positive exponent shifts the decimal point to the right; a negative exponent shifts it to the left. This is the same as the scientific notation we learned in high school; the exponent is the power of 10 applied to the significand.
Unless otherwise specified, C assumes that all floating-point literals are doubles. A suffix of f or F makes a constant a float. For example, 0.0F is a float-type zero.
You can define literal floating-point constants using hexadecimal notation. A hexadecimal floating-point constant consists of an optional sign, 0x prefix, a hexadecimal significand, a p to indicate the beginning of the exponent, a required exponent value, and an optional type suffix. Examples of hexadecimal constants are:
0x1.FFFFFEp127f 0x1p-23 -0x1.2ACp+10Hexadecimal floating-point constants are literal bitmaps of a floating-point value. In a hexadecimal constant, the exponent value is the power of 2 to which the significand is raised.
The result of a division operation, such as 1.0/ 0.0, is INFINITY. An FE_OVERFLOW exception is raised whenever the result of a calculation is INFINITY.
Signaling NaNs are never generated by function calls or calculations. The only way they appear in your programs is if you explicitly use the NANS macro (defined in fltpnt.h) in your programs.
Subnormal values involved in a calculation reduce the accuracy of the result. You may want to use the fpclassify macro to check for subnormal values when working with very tiny numbers.
signed char c; unsigned char uc; short i1, i2; unsigned short ui; long l; float f; double d;Conversions take place implicitly in calculations or explicitly when a cast is performed. The term source type refers to the type of value being converted; the term destination type refers to the type of the conversion's result.
c = -23; i1 = (int) c;In this code fragment, an 8-bit signed value, c, has been assigned to a 16-bit signed value, i1. A cast tells the compiler that the conversion is legitimate. Whenever converting between different data types, it is good programming practice to use a cast to declare explicitly that a conversion is expected.
First, the value of c is copied directly into the lower 8 bits of i1. Then, the high (sign) bit of c is copied into each of i1's upper 8 bits. Because negative numbers are stored in twos complement form, this preserves both the value and the sign of c when it is copied into i1. The table below provides a schematic view of the process of promotion.
Table 28-2 Conversion of char to int Step Binary value of i1 Comment 1 00000000 11101001 copy c into lower 8 bits 2 11111111 11101001 extend sign in upper 8 bitsAs long as the destination type is signed and larger than the source type, the destination value equals the source value. This also holds true if both the source and destination types are unsigned. However, when the destination type is unsigned and the source type is signed, the destination cannot correctly hold the source value. This assignment demonstrates the problem:
ui = (unsigned int) c;The compiler assigns the same binary pattern to ui that it does to i1. However, since ui is unsigned, those bits are construed differently from how they were when they were assigned to i1. Note: the only difference between signed and unsigned types of the same size is the way in which they decipher their bits. A signed value takes a high-order bit of 1 to mean that it is holding a negative value in twos complement form. An unsigned type is a positive binary integer. For i1, the bits 11111111 11101001 are interpreted as -23, the original value of c. For ui, the same bits result in the value 65513.
When converting from a larger source type to a smaller destination type, the compiler simply assigns the low-order bits of the source value to the destination. Here are several examples.
ui = 64; /* lower 8 bits are 01000000 */ c = (char) ui;/* c equals 64 */ uc = (unsigned char) ui;/* uc equals 64 */In this case, the lower 8 bits of ui are assigned to c and uc. Since the high-order bit is not set (that is, c is positive), ui and c represent the same number.
ui = 30113; /* lower 8 bits are 10100001 */ c = (char) ui; /* c equals -95 */ uc = (unsigned char) ui;/* uc equals 161 */In this case, c is negative since the high-order bit of the value copied from ui is 1.
i1 = -12345; /* lower 8 bits are 11000111 */ c = (char) i1; /* c equals -57 */ uc = (unsigned char) i1;/* uc equals 199 */Again, the high-order bit is 1, so c has a negative value. As mentioned in the section "Integral data types", char, signed char, and unsigned char are distinct types. The Digital Mars compiler generates warnings if you mix the different character types. Furthermore, even if short and int are the same size, they are not the same type. If your programs assume that short and int are always identical, problems will occur. In 32-bit compilations, int is defined as 32 bits and short as 16 bits. Different types are different types, and you should never assume otherwise.
Converting from a float to a double, the value is almost unchanged. Slight differences in value occur because of the differing number of significand bits in the types. The following program converts between float and double values, illustrating how the conversions alter the values being exchanged. For now, you can ignore the specifics of floating-point I/O, as well as the FLT_DIG and DBL_DIG macros. These subjects are covered in later sections. What is important is to see the results of converting between float and double values.
#include <stdio.h> #include <float.h> #define pi 3.14159265358979 int main() { float f; double d; d = 0.0; printf(" d <= 0.0,\td => %.* g\n", DBL_DIG, d); f = 3.606F; printf(" f <= 3.606,\tf => %.* g\n", FLT_DIG, f); d = (double) f; printf(" d <= (double) f,\td => %.* g\n", DBL_DIG, d); f = 2.0F / 3.0F; printf(" f <= 2.0/ 3.0,\tf =>%.* g\n", FLT_DIG, f); d = (double) f; printf(" d <= (double) f,\td => %.* g\n", DBL_DIG, d); d = d * 2.0; printf(" d <= d * 2.0,\td =>%.* g\n", DBL_DIG, d); d = 2.0 / 3.0; printf(" d <= 2.0/ 3.0,\td =>%.* g\n", DBL_DIG, d); d = d * 2.0; printf(" d <= d * 2.0,\td =>%.* g\n", DBL_DIG, d); d = pi; printf(" d <= pi,\td => %.* g\n", DBL_DIG, d); f = (float) d; printf(" f<=(float) d,\tf => %.* g\n", FLT_DIG, f); d = (double) f; printf(" d<=(double) f,\td => %.* g\n", DBL_DIG, d); return 0; }In a perfect world, where numbers would be stored as exact decimal digits, the output of the above program would look like this:
d <= 0.0, d => 0 f <= 3.606, f => 3.606 d <= (double) f, d => 3.606 f <= 2.0/ 3.0,f => 0.666667 d <= (double) f, d => 0.66666666666667 d <= d * 2.0,d => 1.33333333333333 d <= 2.0/ 3.0,d => 0.66666666666667 d <= d * 2.0,d => 1.33333333333333 d <= pi, d => 3.14159265358979 f <= (float) d, f => 3.14159 d <= (double) f, d => 3.14159However, the actual output looks like this:
d <= 0.0, d => 0 f <= 3.606, f => 3.606 d <= (double) f, d => 3.60599994659424 f <= 2.0/ 3.0,f => 0.666667 d <= (double) f, d => 0.666666686534882 d <= d * 2.0,d => 1.33333337306976 d <= 2.0/ 3.0,d => 0.666666666666667 d <= d * 2.0,d => 1.33333333333333 d <= pi, d => 3.14159265358979 f <= (float) d, f => 3.14159 d <= (double) f, d => 3.14159274101257The difference between the ideal output and the real output is due to the binary nature of floating-point numbers. The significand is only an approximation of a decimal fraction; the number of decimal digits that can be accurately approximated by a floating-point data types is determined by the number of binary digits in the significand.
A float has 23 bits in its significand, and a double has 52 bits. If you hand-calculate the largest possible value that can be stored in a float, you'll find that it can actually represent as many as nine decimal digits. However, since this is a binary fraction and not a decimal fraction, those nine digits may not be correct. Usually, the last two decimal digits represented by a floating-point type ensure that the other decimal digits are accurate. So, a float accurately represents only six decimal digits of a value, while a double represents up to 15 digits.
Note: The macros FLT_DIG and DBL_DIG, defined in float.h as 6 and 15 (and fully discussed in the section "Floating point characteristic constants" later in this chapter), represent the number of significant decimal digits for the float and double types. Use these macros with the %.* g format specifier when displaying floating-point values to their full precision with the printf family of functions.
You may be surprised when displaying all 15 digits of a double that have been assigned a float value. When a floats significand is copied into the significand of a double, the extra digits are filled with zeros. The "extra" precision bits of the float, which are not part of its six decimal digit accuracy, are part of the doubles 15- digit decimal accuracy. If you are interested in accurate math, do not mix floats and doubles in calculations. Once you have assigned a float value to a double, you can only assume that the double has six digits of decimal accuracy as opposed to 15.
When assigning an integral value to a floating-point value, the least significant digits may be lost if there are more digits in the integer than can be represented by the floating-point type. For example:
f = (float) 2123456789UL; printf("f <= (float) 2123456789UL,\tf => %.* g\n", FLT_DIG, f);The output of the printf statement will be:
f <= (float) 2123456789UL, f => 2.12346E+009A float has only six valid digits, whereas the original unsigned long value had 10 digits. So, the unsigned long value is rounded to the six digits when it is converted to a float.
Table 28-3 Conversion order Order Types 0 all char and short types 1 int 2 unsigned int 3 long 4 unsigned long 5 float 6 doubleThe type of the result obtained from a binary operation is the same as the type used for the calculation. For example:
i1 + c /* c converted to int; result is int */ l / f /* l converted to float; result is float*/Based on Table 28-3, i1 (an int) has a higher order than c (a char) so the compiler converts c to an int before performing the addition. The result of the addition is an int. Using the same rule, l is converted to a float before it is divided by f. The result is a float.
Watch for implicit conversions! They can turn a simple math statement into a mysterious program bug. For example, assuming the following assignments:
i1 = 20000; i2 = 30000;these two statements do not assign the same value to l:
l = i1 * i2 * 2.0L; /* formula 1 */ l = 2.0L * i1 * i2; /* formula 2 */Operations of the same precedence take place from left to right. In the first formula, i1 is multiplied by i2; these are both ints, making the result an int. The result of the multiplication is 600000000; an int cannot hold a value this large, so an overflow occurs, giving the result an actual value of 17920. That result is then converted to a long so that it can be multiplied by 2.0L. The end result stored in l is 35840, an answer that is far from what is expected.
The second formula rearranges the operations. The first calculation multiplies 2.0L by i1; i1 is converted to a long, and a resulting value (with type long) of 40000 is generated. This is multiplied by i2, which is converted to a long. The end result is 1200000000, the expected value.
The first formula can return the same value as the second formula, if you use casts or parentheses:
l = i1 * (i2 * 2.0L); l = (long) i1 * (long) i2 * 2.0L;Many a program has died because the programmer did not watch for implicit conversions. A careful programmer always casts the operands of a formula if they are different from the type of the result. A small amount of extra typing can save hours of debugging.
One final caution: comparisons are a form of calculation. When two values with different types are compared, the smaller type is converted into the larger type. Watch for this, particularly when comparing signed and unsigned values or when the comparands are different sized types.
For example, if you ask Digital Mars C++ to perform these calculations:
float a = 1.5; float b = 2.01; float c, d; c = a * b;the compiler dutifully assigns to c the result of multiplying 1.5 by 2.01, which is 3.015. However, the accuracy of that result must be compared against the accuracy of the numbers used to calculate it. In this case, while b has four decimal digits, we know the accuracy of a only to two decimal places. Unless we know that a is exactly equal to 1.5, we cannot assume that c is exactly equal to 3. 015. It is a case of being sure of our results. If we are not certain of the value of a beyond the second digit, then we cannot be certain beyond the second decimal place about the value of any result that is calculated using a.
It is common in science to know a value to a specific number of decimal places without certainty that the value is exact. As long as there is doubt about the absolute accuracy of a value, all calculations involving that number are limited to the number of digits that we know are right. In the example above, the correct value for c is 3.0, since the least accurate of operands used in its calculation, a, has only two digits of accuracy.
Of course, the compiler knows nothing about this. It goes about its business of assigning 3.015 to c. What happens later in the program when another calculation involving c is made?
d = c * 250.; // assume 250. is exactThe compiler assigns d the value 3.015 * 250, or 753.75. The error continues to mount! The correct result should be 750, since c is known reliably only to two digits of accuracy, making it 3.0. The problem only grows worse as calculations continue.
The solution is to use the rounding features of Digital Mars C++ to fix the number of significant digits in a value. Here's an example of a function named setsigdig, which rounds a number to a specific number of significant digits.
#include <float.h> #include <fltenv.h> #include <fltpnt.h> #include <math.h> double setsigdig(double x, unsigned int n) { double shift, result; int oldmode; if ((n == 0U) || (n > DBL_DIG)) result = x; else { // save the current rounding mode oldmode = fegetround(); // set rounding to truncate // the fractional part fesetround(FE_TOWARDZERO); // adjust the number of significant // digits --n; // calculate the number of digits // to be shifted shift = pow(10.0, double n) * rint(log10(fabs(x))); // restore the original rounding mode fesetround(oldmode); // scale x up, round it, // and scale it back result = rint(x * shift) / shift; } return result; }The setsigdig function works by scaling a double value to an integer with the requested number of significant digits, rounding that value, and then scaling the number back to its original magnitude. The fegetround, fesetround, and rint functions are documented in the section "Rounding functions" later in this chapter.
Table 28-4 describes the floating-point comparison operators, showing the conditions that generate a true (nonzero) or false (zero) result for a comparison. The four relations columns show the results of a comparison for each of four possible relationships between two operands: greater than (>), less than (<), equals (=), and unordered (!).
A comparison is unordered when one (or both) of its operands is a NaN.
Cross-reference the relationship of the operands with the operator; if the table is marked T, the relationship is true (nonzero). If the table is marked F, the relationship is false (a value of zero). Also, if the comparison operator has "yes" listed in the invalid column, the FE_INVALID exception is raised (see "Exceptions") for unordered operands.
Table 28-4 Floating point comparison operators Operator Relations Invalid? Description > < = ? < F T F F yes less than > T F F F yes greater than <= F T T F yes less than or equal to >= T F T F yes greater than or equal to == F F T F no equal to != T T F T no unordered, less than, or greater than !<>= F F F T no unordered <> T T F F yes less than or greater than <>= T T T F yes less than, equal to, or greater than !<= T F F T no unordered or greater than !< T F T T no unordered, greater than, or equal to !>= F T F T no unordered or less than !> F T T T no unordered, less than, or equal to !<> F F T T no unordered or equal toHere are some example comparisons:
NAN != 0.0 True NAN == 0.0 False NAN == NAN False NAN != NAN True NAN !<>= NAN True 1.0 !< 2.0 True 2.0 <> 2.0 False 2.0 != 2.0 FalseThe operators that contain an exclamation point, also called a "bang," are true when the relationship is unordered. All of the floating-point operators can be overloaded in C++ programs.
Be sure to understand the result of negating comparisons. The statement !(a op b) reverses the true and false results given for op in Table 28-4. For example, !(a < b) is not the same thing as (a >= b).
Avoid using an equality condition to end a calculation loop, such as:
float x, float y = 3.5; do { // calculate x } while (x != y);The calculation may vary the last few bits of x slightly, so that it never quite exactly matches y. That, of course, generates an infinite loop. Comparing the difference between two floating-point values is a better solution:
float x, float y = 3.5; do { // calculate x } while (fabs(x -y) > FLT_EPSILON);FLT_EPSILON is a constant defined in float.h; it expands to the smallest possible difference between two floating-point values. The loop may be slightly slower, but minor bit fluctuations in x are less likely to make the loop run forever.
INFINITY is generated by the programmer, when division by zero occurs, or when there is an overflow (see the "Exceptions" section later in this chapter). An FE_INVALID exception is raised when INFINITY is used in the following calculations:
For one thing, NaNs breed new NaNs. Any binary operation involving a NaN generates a quiet NaN. With a few minor exceptions, all the floating-point library functions listed later return a quiet NaN if one or more of their arguments is a NaN. Once a quiet NaN is introduced into a series of calculations, it propagates itself. This can be useful when working with variables that may not have defined values. For example, this program has a subtle bug:
float a, b, c; // lots of program code b = a * 2.0f; // b equals almost anything c = 4.5 / b; // and so c could equal anythinga was never assigned a value, and it may contain any random value. The problem is, the calculations of b and c proceed without any warning that a was never initialized. In fact, a could hold a semivalid random value, and b and c would subsequently be calculated as having numbers that may not look wrong but are! There's no way to know from looking at b or c that an invalid value was introduced into their calculation.
Never leave anything to chance. Always initialize variables immediately after they are declared. If you do not have a valid value yet, initialize a variable to NAN. Subsequent calculations result in the propagation of the NaN; if your final result turns out to be a NaN, you know that an important piece of data was not given a value.
float a = NAN, b = NAN, c = NAN; // imagine lots of program code b = a * 2.0f; // b is a NaN because a is a NaN c = 4.5 / b; // c is a NaN because b is a NaNThe common practice is to initialize undefined variables with a quiet NaN. Then, as the program proceeds, the quiet NaNs propagate themselves, showing where your program is lacking information. This is particularly useful when performing calculations involving matrices that contain unknown values. The NaNs propagate across calculations, and you can tell from the NaNs in the results how much vital information is missing. Or, you can use signaling NaNs instead of quiet ones, so that the first calculation involving an unknown quantity results in an exception.
The sign of a NaN is unimportant, although the state of sign bit can be determined using the signbit macro.
The sum of two equal operands with opposite signs (or the difference of two equal operands with like signs) results in a +0.0, unless the rounding mode is set to FE_DOWNWARD, in which case the result is -0.0. Positive and negative zero are considered equal for comparisons.
double randomd() { return (double) rand() / INT_MAX; }
float fa[10];has 10 elements, numbered from 0 to 9. A simple technique allows the indexing of the array's elements from 1 to 10 by exploiting the close relationship between pointers and arrays. Create a pointer to the array, set it to point to the elements just before the beginning of the array, and then index through the array via the pointer:
float fa[10], *fap; fap = fa; --fap;Use caution to avoid indexing outside an array's defined range. C and C++ do not perform range checks; if you change the 23rd element of a 10 element array, you will overwrite portions of memory outside the array's bounds.
Digital Mars C++ can generate programs that work only with the coprocessor. These are the fastest floating-point programs because they use the hardware most efficiently. However, this added speed comes at a cost: programs compiled specifically for the coprocessor do not run on machines that lack coprocessors.
If you do not have a coprocessor or do not want to rely upon one, Digital Mars C++ can generate a program that uses a software emulator to perform floating-point calculations. The emulator is many times slower than a coprocessor, but your programs can run on computers that do not have coprocessors. The emulator will use the coprocessor if one is available, but it will not use the coprocessor as efficiently as would a program compiled specifically for a coprocessor.
The difference in speed between a coprocessor-based program and an emulator-based program can be significant. For example, a program compiled with Digital Mars C++ ran for 43 seconds when using a coprocessor; run with the emulator, the same program took nearly 15 minutes to execute! Coprocessor-only programs are also slightly smaller than emulator-based programs. For any intensive mathematical work, a coprocessor is a requirement. If you are serious about floating-point performance, a coprocessor is well worth its purchase price.
When a program begins, it assumes that you want to round to the nearest value. A program can change the rounding mode at any time by calling the fesetround function, using one of the macro constants to select a rounding mode. For example, the statement
fesetround(FE_TOWARDZERO);changes the rounding toward zero.
When changing the rounding mode, the previous rounding mode is lost. The fegetround function retrieves the current rounding mode as an int value. A well-written function saves the current rounding mode before changing it; when the function is done, it restores the original rounding mode. For example:
double math_function(double x) { double result; int save_rounding; save_rounding = fegetround(); // save rounding mode fesetround(FE_UPWARD); // set rounding mode // calculate result fesetround(save_rounding); // restore rounding mode return result; }You can estimate the amount of rounding errors in a calculation by performing the calculation twice with two rounding modes. First, set the rounding mode to FE_UPWARD, perform the calculation, and save the result. Then, set the rounding mode to FE_DOWNWARD and perform the calculation again. The difference between the first and second calculations is the rounding error.
The complete specifications of the rounding macros and functions are provided in the section "Rounding functions" later in this chapter.
If a program is compiled for use with the coprocessor, all float and double calculations are carried out to a higher level of precision. The coprocessor uses 80-bit temporary floating-point values for its computations. This provides a very high level of accuracy without affecting the speed of calculations. Because the coprocessor uses more bits for its computations (increasing its accuracy), an emulator-based program may produce slightly different results when running on a machine equipped with a coprocessor.
A program compiled for the coprocessor can change the number of bits used by the coprocessor for calculations. At program startup, the coprocessor is set to use a full 80 bits of precision. The fesetprec function can tell the coprocessor to use fewer bits. While reducing the number of bits does not make the coprocessor perform more quickly, using fewer bits generates results that match those produced by the emulator. These are examples of calls to fesetprec:
fesetprec(FE_FLTPREC); // use float (32-bit) precision fesetprec(FE_DBLPREC); // use double (64-bit) precision fesetprec(FE_LDBLPREC); // use long double (80-bit) precisionThe fegetprec function returns the current precision mode; use it to save the rounding mode when a function changes it, just as the fegetround function you use to save the rounding mode. The precision functions and macro constants are fully documented in the section "The Floating Point Library" later in this chapter.
Digital Mars C++ supports five exceptions that indicate the problems that occur when floating-point calculations take place. These exceptions are:
Exceptions are raised via the feraiseexcept function. A well-written program uses these exceptions to indicate error conditions. For example, most floating-point functions should follow this general framework when handling NaN parameters:
double float_func(double x) { double result; if (isnan(x)) { if (fpclassify(x) == FP_NANS) feraiseexcept(FE_INVALID); return x; } // calculate result return result; }If a function parameter exceeds some limit, an FE_INVALID exception should be raised, and NAN should be the return value.
The fetestexcept function checks to see if one or more exceptions have been raised:
// is the overflow exception raised? if (fetestexcept(FE_OVERFLOW)) { // print error message or // otherwise handle exception }Exceptions are "sticky"; once an exception is raised, it remains raised until the program explicitly clears the exception with the feclearexcept function. A calculation can proceed until its conclusion, at which point the program can check to see which exceptions, if any, have been raised. However, the program must also clear the exceptions before any calculations, so that leftover exceptions from previous calculations do not obfuscate newly raised exceptions. To clear all the exceptions, this function call can be used:
feclearexcept(FE_ALL_EXCEPT);The arguments for feclearexcept, fetestexcept, and feraiseexcept can be one of the macro constants defining exceptions. You can combine several exception macros using the bitwise OR operator, so that more than one exception can be tested/ cleared/raised in one function call:
feraiseexcept(FE_OVERFLOW | FE_UNDERFLOW);FE_ALL_EXCEPT is a macro that evaluates to a bitwise OR of all the exception macro's constants.
Function prototypes and macro definitions for handling exceptions can be found in the file fltenv.h.
EDOM domain error ERANGE range errorA domain error occurs when a parameter passed to a floating-point function is outside the range of that function's calculation range. For example, the sqrt function sets errno to EDOM if it is asked to calculate the square root of a negative value. Range errors occur upon underflow and overflow.
In general, do not use errno for tracking floating-point errors. The facilities provided by exceptions provide more granularity in determining the type of error and are more flexible than errno.
#include <string.h> #include <stdio.h> #include <fltenv.h> #include <setjmp.h> // prototype void _far _cdecl FloatError(int sig); // global variable to hold // setjmp/ longjmp information volatile jmp_buf ErrJump; // main function int main() { // assign floating-point exception handler signal(SIGFPE, FloatError); // set-up return address for errors if (setjmp(ErrJump)) return 1; //********************************* // do something with floating-point //********************************* // outta here return 0; } // floating-point error handler void _far _cdecl FloatError(int sig) { char temp[256]; strcpy(temp," Floating point error: "); if (fetestexcept(FE_INEXACT)) strcat(temp,"[Inexact] "); if (fetestexcept(FE_DIVBYZERO)) strcat(temp,"[Divide by 0] "); if fetestexcept(FE_UNDERFLOW)) strcat(temp,"[Underflow] "); if fetestexcept(FE_OVERFLOW)) strcat(temp,"[Overflow] "); if fetestexcept(FE_INVALID)) strcat(temp,"[Invalid] "); printf("\a%s\n", temp); longjmp(ErrJump, 1); }The detailed workings of the setjmp, longjmp, and signal functions are outside the scope of this manual. Suffice it to say that the call to signal tells the program to call FloatError whenever a floating-point error occurs. The setjmp function marks a program location to which FloatError returns (via longjmp) after processing the error.
To simply preserve the floating-point environment, the fesetenv and fegetenv functions can handle the task. fegetenv stores the current state of the floating-point in a fenv_t object; fesetenv sets the state of the floating-point environment to the value of a fenv_t object. So, a function that wishes to preserve the floating-point environment can perform these function calls:
double num_func(double x) { double result; fenv_t saved_env; // save the floating-point environment fegetenv(&saved_env); [...] // change the environment and calculate a result // restore the original environment fesetenv(&saved_env); return result; }The feprocentry and feprocexit functions do more than preserve the floating-point environment; they also reset the environment to the startup conditions: round to nearest, FE_LDBLPREC precision, and no exceptions set. To use the default settings, thus avoiding any unpredictable changes made since the program started, use feprocentry and feprocexit instead of fegetenv and fesetenv:
double num_func(double x) { double result; fenv_t saved_env; // save floating-point environment // & restore default feprocentry(&saved_env); [...] // calculate a result // restore the program's previous // environment feprocexit(&saved_env); return result; }
The function descriptions often have an associated Special Results table that describes how the function handles certain kinds of arguments. The column headings mean:
Before using any of the functions, macros, data types, or constants defined in these headers, #include the appropriate header file into your program.
To ease the development of efficient and portable applications, NCEG created two compiler-defined floating-point types, float_t and double_t. These types are defined in fltpnt.h as representing the most efficient data types for single-and double-precision computations in the host environment. In the case of Digital Mars C++, float_t is defined as float, and double_t is defined as double.
Macro Name | Definition |
---|---|
FLT_DIG | The number of decimal digits that can be represented accurately by a float. |
FLT_MANT_DIG | The number of baseFLT_RADIX digits in the significand of a float. In the case of binary floating-point implementations (like on Intel x86 processors), this number represents the number of binary digits in the significand. |
FLT_MAX_10_EXP | Maximum positive integer n such that 10n within the range of normalized floating-point numbers |
FLT_MAX_EXP | Maximum positive integer n such that FLT_RADIXn-1 is representable |
FLT_MIN_10_EXP | Minimum negative integer n such that 10n is within the range of normalized floating-point numbers |
FLT_MIN_EXP | Minimum negative integer n such that FLT_RADIXn-1 is a normalized floating-point number |
FLT_RADIX | Radix of exponent representation |
FLT_ROUNDS | Direction of rounding |
FLT_MAX | Maximum representable positive floating-point number |
FLT_MIN | Minimum representable normalized positive floating-point number |
FLT_EPSILON | Minimum positive number x such that 1.0+x does not equal 1.0 |
DBL_xxx | double versions |
LDBL_xxx | long double versions |
The two data types defined in fltenv.h are:
FE_INEXACT FE_DIVBYZERO FE_UNDERFLOW FE_OVERFLOW FE_INVALIDThis macro represents the bitwise-OR (|) of all five of the above-mentioned exception flags:
FE_ALL_EXCEPTThese macros are used with the feraiseexcept, fetestexcept, feclearexcept, fegetexcept, and fesetexcept functions.
FE_TONEAREST Round to the nearest value FE_UPWARD Round toward positive infinity FE_DOWNWARD Round toward negative infinity FE_TOWARDZERO Round toward zero (drop fractional part)Each macro expands to a unique, positive constant of type int. The fesetround and fegetround functions use these macros.
FE_FLTPREC Use float (32-bit) precision FE_DBLPREC Use double (64-bit) precision FE_LDBLPREC Use long double (80-bit) precisionEach of these macros expands to a unique constant of type int. The fesetprec and fegetprec functions use these macros. These macros have no effect unless a coprocessor is being used.
FP_NANS Signaling NaN FP_NANQ Quiet NaN FP_INFINITE Infinity FP_NORMAL Any number not covered by other classifications FP_SUBNORMAL Subnormal or denormal FP_ZERO ZeroThese macros expand to unique constant expressions of type int; they are used to classify floating-point values.
These macros can be compared to the result from the macro fpclassify, which is discussed later in this section.
NAN Expands to represent the value of a quiet NaN NANS Expands to represent the value of signaling NaN INFINITY Expands to represent the value of positive infinityUnary + and -operators can set the sign of these values; for example, -INFINITY represents negative infinity.
Note: The precision mode has meaning only when inline coprocessor instructions are used. The software floating-point emulator ignores the precision mode.
Note: Parameter values representing angles larger than p/4 may result in long calculation times or FE_INVALID exceptions, depending upon the type of coprocessor installed in a given computer.
Convert text to double Function atof Header stdlib.h Prototype double atof(const char *nptr); Purpose Converts the characters in the string (pointed to by nptr) into a double. Returns A double based on the text pointed to by nptr. Comments atof is implemented as if it were a call to strtod. Convert string to float Function strtof, strtod Header stdlib.h Prototype float strtof(const char *nptr, char ** endptr); double strtod(const char *nptr, char ** endptr); Purpose Converts the character representation of a floating-point value to a float or double. The function assumes that the input string (pointed to by nptr) is in the same format as a literal floating-point constant. The first character encountered that cannot be a part of a literal floating-point constant terminates the translation process. The address of the terminating character is assigned to the pointer variable pointed to by endptr. +-NAN and +-INF are valid input values representing NaN and infinity. Returns A float or double value based on the string pointed to by nptr.
Examine sign (macro) Function signbit Header fltpnt.h Purpose Examines the sign bit of a floating-point value, including NaNs, infinities, and zero. Returns Nonzero if the sign bit is 1; zero if the sign bit is 0.
The coprocessor's control word changes the behavior of the coprocessor. Its important bits are:
00 --round toward nearest or even
01 --round toward negative infinity
10 --round toward positive infinity
11 --round toward zero (truncate)
if (mask != 0) cw = ((cw & ~ mask) | (new & mask));where cw is the current value of the control word. Defini-tions of the bits in the control and status words are in float.h.
Calling the coprocessor-specific functions is not advised if the functions listed in the section "The Floating Point Library" (earlier in this chapter) are also being used to change the floating-point environment. The two sets of functions are unaware of each other, and conflicts may arise if one set of functions makes a change that the other set of functions is not aware of. The coprocessor control functions are provided for compatibility with other MS-DOS C and C++ compilers.
Assembly language functions that call the coprocessor can use all of the coprocessor's stack registers, provided that those registers are left empty when the function returns. When calling floating-point library functions from assembly language routines, assume that the function will use the entire coprocessor stack. It is a calling function's responsibility to save and restore coprocessor registers.
main() { extern int _8087; _8087 = 0; /* never use 80x87 */ ... ... }
SC file -fCode compiled with the -f option crashes if a math coprocessor is not present. We recommend you insert the following code when using the -f option:
void main() { extern int _8087; if (_8087 == 0) /* if no 80x87 present */ { printf("This program requires an 80x87\n"); exit(1); /* halt program */ } ... }Experienced programmers using option 4 can modify and recompile the libraries to remove all the non-coprocessor floating-point code. This makes the program smaller and slightly faster.
All four options use the same Digital Mars libraries.
Note: The above options do not change call/return conventions between functions; therefore, you can link object modules compiled using any option.
For example, an application that does a matrix inversion can have two versions:
SC -c -f matrix -DINLINE8087 -omatrix87. obj SC -c matrix SC main matrix. obj matrix87. objThe file matrix.c contains:
#ifdef INLINE8087 void matrix87() /* different name for 87 version*/ #else void matrix() #endif { ... }The file main.c contains:
extern int _8087; ... if (_8087) /* if there is an 80x87 */ matrix87(); /* use 80x87 version */ else matrix(); /* use software floating */ /* point version */When you write an executable program this way, it can exploit all the advantages of the coprocessor but still work correctly if the target machine does not have one. You can find similar examples in the math.c library source code.
-fd Generate IEEE 754 inline 8087 code with workaround for FDIV bug
-ff Generate fast inline 8087 code
The relevant header file is complex.h. The old C++ complex class conflicts with this, and was renamed to oldcomplex.h. There is no reason to use the complex class in new code, as the native types supercede all the functionality in the class.
A whole range of new library functions are added to deal with complex numbers. They're documented in Complex Numbers.