Just knowledge it: characters and strings

While the attention of the first computer designers focused mainly on numeric calculations, it was clear that much of the data that business people and others would want to manipulate with the new machines would be textual in nature. Billing records, for example, would have to include customer names and addresses, not just balance totals.

The “natural” representation of data in a computer is as a series of two-state (binary) values, interpreted as binary numbers. The solution for representing text (letters of the alphabet, punctuation marks, and other special symbols) is to assign a numeric value to each text symbol. The result is a character code, such as ASCII (American Standard Code for Information Interchange), which is the scheme used most widely today. (Another system, EBCDIC (Extended Binary-Coded Decimal Interchange Code) was used during the heyday of IBM mainframes, but is seldom used today.)

The seven-bit ASCII system is compact (using one byte of memory to store each character), and was quite suit-able for early microcomputers that required only the basic English alphabet, punctuation, and a few control charac-ters (such as carriage return). In an attempt to use charac-ters to provide simple graphics capabilities, an “extended ASCII” was developed for use on IBM-compatible PCs. This used eight bits, increasing the number of charac-ters available from 128 to 256. However, the use of bit-mapped graphics in Windows and other operating systems made this version of ASCII unnecessary. Instead, the ANSI (American National Standards Institute) eight-bit charac-ter set used the additional character positions to store a variety of special symbols (such as fractions and the copy-right symbol) and various accent marks used in European languages.

Table of 7-Bit ASCII Character Codes

The following are control (nonprinting) characters: 0 Null (nothing)

7 Bell (rings on an old teletype; beeps on most PCs)

8 Backspace

9 Tab

10 Line feed (goes to next line without changing column position)

13 Carriage return (positions to beginning of next line)

26 End of file

27 [Esc] (Escape key)

The characters with codes from 32 to 127 produce printable characters.

32 [space] 64 @ 96 `

33 ! 65 A 97 a

34 “ 66 B 98 b

35 # 67 C 99 c

36 $ 68 D 100 d

37 % 69 E 101 e

38 & 70 F 102 f

39 ‘ 71 G 103 g

40 ( 72 H 104 h

41 ) 73 I 105 i

42 * 74 J 106 j

43 + 75 K 107 k

44 ‘ 76 L 108 l

45 - 77 M 109 m

46 . 78 N 110 n

47 / 79 O 111 o

48 0 80 P 112 p

49 1 81 Q 113 q

50 2 82 R 114 r

51 3 83 S 115 s

52 4 84 T 116 t

53 5 85 U 117 u

54 6 86 V 118 v

55 7 87 W 119 w

56 8 88 X 120 x

57 9 89 Y 121 y

58 : 90 Z 122 z

59 ; 91 [ 123 {

60 < 92 \ 124 |

61 = 93 ] 125 }

62 > 94 ^ 126 ~

63 ? 95 - 127 [delete]

As computer use became more widespread internation-ally, even 256 characters proved to be inadequate. A new standard called Unicode can accommodate all of the world’s alphabetic languages including Arabic, Hebrew, and Japa-nese (Kana Unicode schemes can also be used to encode ideographic languages (such as Chinese) and languages such as Korean that use syllabic components. At present each ideograph has its own character code, but Unicode 3.0 includes a scheme for describing ideographs through their component parts (radicals). Most modern operating systems use Unicode exclusively for character representation. How-ever, support in software such as Web browsers is far from complete, though steadily improving. Unicode also includes many sets of internationally used symbols such as those used in mathematics and science. In order to accommodate this wealth of characters, Unicode uses 16 bits to store each character, allowing for 65,535 different characters at the expense of requiring twice the memory storage.

Programming with Strings

Before considering how characters are actually manipulated in the computer, it is important to realize that what the binary value such as 1000001 (decimal 65) stored in a byte of memory actually represents depends on the context given to it by the program accessing that location. If the program declares an integer variable, then the data is numeric. If the program declares a character (char) value, then the data will be interpreted as an uppercase “A” (in the ASCII system).

Most character data used by programs actually repre-sents words, sentences, or longer pieces of text. Multiple characters are represented as a string. For example, in tradi-tional BASIC the statement:

NAME$ = “Homer Simpson”

declares a string variable called NAME$ (the $ is a suffix indicating a string) and sets its value to the character string “Homer Simpson.” (The quotation marks are not actually stored with the characters.)

Some languages (such as BASIC) store a string in mem-ory by first storing the number of characters in the string, followed by the characters, with one in each byte of mem-ory. In the family of languages that includes C, however, there is no string type as such. Instead, a string is stored as an array of char. Thus, in C the preceding example might look like this:

char Name [20] = “Homer Simpson”;

This declares Name as an array of up to 20 characters, and initializes it to the string literal “Homer Simpson.”

An alternative (and equivalent) form is:

char * Name = “Homer Simpson”;

Here Name is a pointer that returns the memory location where the data begins. The string of characters “Homer Simpson” is stored starting at that location.

Unlike the case with BASIC, in the C languages, the number of characters is not stored at the beginning of the data. Rather, a special “null” character is stored to mark the end of the string.

Programs can test strings for equality or even for greater than or less than. However, programmers must be careful to understand the collating sequence, or the order given to characters in a character set such as ASCII. For example the test

If State = “CA”

will fail if the current value of State is “ca.” The lowercase characters have different numeric values than their upper-case counterparts (and indeed must, if the two are to be distinguished). Similarly, the expression:

“Zebra” < “aardvark”

is true because uppercase Z comes before lowercase “a” in the collating sequence.

Programming languages differ considerably in their facilities for manipulating strings. BASIC includes built-in functions for determining the length of a string (LEN) and for extracting portions of a string (substrings). For example given the string Test consisting of the text “Test Data,” the expression Right$ (Test, 4) would return “data.”

Following their generally minimalist philosophy, the C and C++ languages contains no string facilities. Rather, they are provided as part of the standard library, which can be included in programs as needed. In the following little program:

#include <iostream.h> #include <string.h> void main ()

{

char String1[20]; char String2[20];

strcpy (String1, “Homer”); strcpy (String2, “Simpson”);

//Concatenate string2 to the end of string1 strcat (String1, String2);

cout String1 <<endl;

}

Here the strcpy function is used to initialize the two strings, and then the strcat (string concatenate) function is used to combine the two strings and store the result back in string1, which is then sent to the output.

As an alternative, one can take advantage of the object orientation of C++ and define a string class. The addition operator (+) can then be extended, or “overloaded” so that it will concatenate strings. Then, the preceding program, instead of using the strcat function, can use the more natu-ral syntax:

cout << String1 + String2

to display the combined strings.

String-Oriented Languages

Sophisticated string processing (such as parsing and pat-tern matching) tends to be awkward to express in tradi-tional number-oriented programming languages. Several languages have been designed especially for manipulating textual data. Snobol, designed in the early 1960s, is best

chatterbots 83

known for its sophisticated pattern-matching and pattern processing capabilities. A similar language, Icon, is widely used for specialized string-processing tasks today. Many programmers working with textual data in the UNIX envi-ronment have found that the awk and Perl languages are easier to use than C for extracting and manipulating data fields. (See awk and Perl.)

Just knowledge it

Search This Blog

Tuesday, 22 October 2013

characters and strings

No comments:

Post a Comment