Search This Blog

Tuesday, 4 November 2014

Information retrieval

While much attention is paid by system designers to the representation, storage and manipulation of information in the computer, the ultimate value of information processing software is determined by how well it provides for the effec-tive retrieval of that information. The quality of retrieval is dependent on several factors: hardware, data organization, search algorithms, and user interface.

At the hardware level, retrieval can be affected by the inherent seek time of the device upon which the data is stored (such as a hard disk), the speed of the central proces-sor, and the use of temporary memory to store data that is likely to be requested (see cache). Generally, the larger the database and the amount of data that must be retrieved to satisfy a request, the greater is the relative importance of hardware and related system considerations.

Data organization includes the size of data records and the use of indexes on one or more fields. An index is a sepa-rate file that contains field values (usually sorted alphabeti-cally) and the numbers of the corresponding records. With indexing, a fast binary search can be used to match the user’s request to a particular field value and then the appro-priate record can be read (see hashing).

There is a tradeoff between storage space and ease of retrieval. If all data records are the same length, random access can be used; that is, the location of any record can be calculated essentially by multiplying the record’s sequence number by the fixed record length. However, having a fixed record size means that records with shorter data fields must be “padded,” wasting disk space. Given the low cost of disk storage today, space is generally less of a consideration.

The search algorithms used by the program can also have a major impact on retrieval speed (see sorting and searching). As noted, if a binary search can be done against a sorted list of fields or records, the desired record can be found in only a few comparisons. At the opposite extreme, if a program has to move sequentially through a whole database to find a matching record, the average number of comparisons needed will be half the number of records in the file. (Compare looking up something in a book’s index to reading through the book until you find it.)

Real-world searching is considerably more complex, since search requests can often specify conditions such as “find e-commerce but not amazon.com” (see Boolean operators). Searches can also use wildcards to find a word stem that might have several different possible endings, proximity requirements (find a given word within so many words of another), and other criteria. Providing a robust set of search options enables skilled searchers to more precisely focus their searches, bringing the number of results down to a more manageable level. The drawback is that complex search languages result in more processing (often several intermediate result sets must be built and internally compared to one another). There is also more likelihood that searchers will either make syntax errors in their requests or create requests that do not have the intended effect.

While database systems can control the organization of data, the pathways for retrieval and the command set or interface, the World Wide Web is a different matter. It amounts to the world’s largest database—or perhaps a “metabase” that includes not only text pages but file resources and links to many traditional database systems. While the flexibility of linkage is one of the Web’s strengths, it makes the construction of search engines difficult. With millions of new pages being created each week, the “web-crawler” software that automatically traverses links and records and indexes site information is hard pressed to capture more than a diminishing fraction of the available content. Even so, the number of “hits” is often unwieldy (see search engine).

A number of strategies can be used to provide more focused search results. The title or full text of a given page can be checked for synonyms or other ideas often associ-ated with the keyword or phrase used in the search. The more such matches are found, the higher the degree of relevance assigned to the document. Results can then be presented in declining order of relevance score. The user can also be asked to indicate a result document that he or she believes to be particularly relevant. The contents of this document can then be compared to the other result docu-ments to find the most similar ones, which are presented as likely to be of interest to the researcher.

Information retrieval from either stand-alone databases or the Web can also be improved by making it unnecessary for users to employ structured query languages (see sql) or even carefully selected keywords. Users can simply type in their request in the form of a question, using ordinary language: For example, “What country in Europe has the largest population?” The search engine can then translate the question into the structured queries most likely to elicit documents containing the answer. Ask Jeeves (retired as of 2006) and similar search services have thus far been only modestly successful with this approach.

On a large scale, systematic information retrieval and analysis (see data mining) has become increasingly sophis-ticated, with applications ranging from e-commerce and scientific data analysis to counterterrorism. Artificial intel-ligence techniques (see pattern recognition) play an important role in cutting-edge systems.

Finally, encoding more information about content and structure within the document itself can provide more accurate and useful retrieval. The use of XML and work toward a “semantic Web” offers hope in that direction (see Berners-Lee, Tim; semantic web; and xml).

No comments:

Post a Comment