Companion Website to the
Third Edition (2001) and Fourth Edition (forthcoming) of
Reference and Information Services
by Richard E. Bopp and Linda C. Smith
Excerpts from CHAPTER 6: UNDERSTANDING ELECTRONIC INFORMATION SYSTEMS FOR REFERENCEE
Chapter Author: Kathleen M. Kluegel
INTERNET SOLUTIONS
People seeking to understand the Internet and use it effectively have a variety of options available. Several print guides aid in the discovery of Internet and Web resources. These include the column "Internet Resources," published in each issue of College & Research Libraries News. Each of the columns focuses on a particular subject area and identifies authoritative resources that deal with the topic. There are columns in newspapers and magazines that identify sites that address consumer, health, travel, entertainment, and other information needs. However, the vast majority of such aids are contained within the Internet itself. It may seem paradoxical to turn to the Internet for help in solving Internet problems, but it can be a very effective strategy. In this section, some of the most widely available systems for navigating the Internet and Web are described.
Web search engines, such as Google, Windows Live Search (MSN) and Yahoo! Search are the primary ways that people discover resources on the Web. Yet, there is a nearly universal feeling of uncertainty in using them. Some of the questions that arise include: Which one works best for this request? Is the coverage complete? How can the search be made more effective? There are no hard and fast answers to any of these questions, but learning the principal mechanisms behind the search engines can help determine the proximate answers for any given question on any given day. It can also help formulate more effective search strategies. What is commonly called a Web search engine is really a Web search system consisting of three major components. The crawler (or spider), the index, and the search engine itself. The crawler is a robot program that is sent out on the Web to discover new Web pages and explore the other Web pages at each site. The crawler brings this information back to the base. The information for each Web page (the URL, the title, words and phrases from the page, etc.) is added to the index. The search engine operates on the information contained in the index. When a search is conducted, the search engine looks through its index and identifies all the pages that "match" the search query.
If all search engines are based on these same three components, how does one explain their differences in action? This chapter discusses the differences in general. For a more thorough examination, one can go to the Web site, Search Engine Showdown: The Users' Guide to Web Searching, created by Greg R. Notess which provides very useful descriptions, comparisons, and reviews of search engines. It is important to understand, however, that none of the search engines separately and not even all of them together index the Web in its entirety. There is a site, www.worldwidewebsize.com, that tries to measure the size of the indexed Web, that is the part of the Web that the search crawlers have retrieved and put into their respective databases.
One difference in search engines lies in the number of pages at each Web site that are selected for indexing. Some Web crawlers identify only the top-level page or top- and second-level pages at each site; others follow all the URL extensions for each site. For example, while all Web crawlers would index a page with the hypothetical URL, http://theuniversityofwonderful.edu, some would omit the third-level URL, http://theuniversityofwonderful.edu/library/information.html. This leads to some difficult choices for Web site developers. On the one hand, a single, main URL gives all the pages a unifying identity and clearly marks all the pages, regardless of level, as belonging to the overarching organization. On the other hand, it can make most of the pages invisible to at least some search engines. If each sub-unit of the main organization has its own top-level URL, it makes more Web pages retrievable, but makes it more difficult for users of the page to recognize the internal relationships of the pages with one another and to explore the organizational hierarchy. This alternative approach would yield URLs which followed this pattern: http://home.universityofwonderful.edu, http://library.universityofwonderful.edu, http://englishdepartment.universityofwonderful.edu, and so forth.
Another difference lies in how each search engine indexes a page that the crawler identifies. Some search engines select every word from the entire Web page for their indexes. Other search engines use formulas that determine what words from which parts of the page are included in the index. For example, a search engine might select all the words from the title and the first twenty lines of the text, supplemented by the hundred most frequently used words in the entire document.
The relative position of terms and phrases within the document is an important element in determining content. A document is likely to put its primary subject in the title and the first paragraph of the document rather than the last paragraph. Terms one encounters for the first time in the last paragraph are not very likely to have received a comprehensive treatment.
Another difference among search engines is the way they determine what a Web page is about. As discussed in Chapter 5, it is a reasonable operating assumption that a document that uses a term many times is more "about" that term than one that uses it only once or twice. Some search engines count the words and phrases contained in the page and use ratio formulas to determine the most important terms. As an additional strategy to try to identify key concepts, a search engine might use statistical frequency to analyze clusters of terms in relation to one another to see if meaningful patterns emerge. Other search engines look at the titles and keywords of pages that are linked to the original page to further reinforce or refine the subject of the page. A page that has the keyword spider, for example, and also has links to pages which have the words spider, silk, and arachnids is more likely to be about the eight-legged creatures than about Web search crawlers.
Some search engines provide another layer of structure for their indexes by identifying the field for selected terms. Typical fields that some search engines include are title, named person, URL elements, and dates associated with the page. These search engines allow the search to specify the kind of information needed to satisfy the search request. These search engines function a bit like the database search systems discussed in Chapter 5. Like those database search engines, the Web search engines require that field limits be constructed according to specific patterns to work correctly. As Randolph Hock notes, some search field elements are interpreted differently by different systems. The date field in particular is one which can mean any one of several dates associated with a page: date updated, date created, date indexed, and so forth. Each search engine has a link to a help screen or a guide for advanced searching that will clarify how to format a field search and provide examples of its correct use. Each one also provides a menu that will assist with field searching among infrequent users of a particular search engine.
Search engines strive to achieve the goals of recall and precision while dealing with the millions of pages and hundreds of millions of indexing terms contained within the Web. In the context of searches, recall refers to the retrieval of all the pages that are relevant to the query, while precision refers to retrieving only the pages that are relevant to the query. The challenges to be faced by search engines in achieving these incompatible goals are substantial. One often hears the phrase "comparing apples and oranges" when describing the difficulty of comparing the relative merits of disparate objects. On the Web, this challenge can be described as "comparing apples, oranges, eagles, mountains, elephants, and petunias." Pages differ in their size, audience, design, content, and structure. They are created by earnest third-graders and eminent professors. They vary in accuracy, reliability, availability, and readability. The language each uses in its content differs from another. Like snowflakes, no two Web pages are alike. However, Web pages, unlike snowflakes, are made up of a variety of elements. A Web page may be all text or mostly images; full of sound or completely silent. A search engine combing the indexes of the Web has to develop strategies for dealing with this overwhelming diversity in ways that allow the users to take advantage of the rich resources available. All the decisions on depth of indexing, patterns, field labels, and other indexing algorithms have an impact on how each search engine achieves this goal.
To achieve recall, Web search engines use a variety of techniques. One of the principal methods is truncation, or stemming. The search words that are input are stripped of their endings, and all the words in the index that match the stem are added to the retrieval results. For example, if the search query includes the term "swimmers", the search engine would stem the word and search for those terms in the index which begin with the root "swim-". Thus "swims," "swimming," "swim," and "swimmer" would all be counted as a match for the search term and retrieved. This is just one matching rule that each search engine defines for itself. Another rule concerns capitalization of search terms. Some search engines disregard the case of the search query and both capitalized and lower case terms are considered a match. Other search engines are case-sensitive and include the case of the search term as an element in determining a match. In concept-based search engines, terms from the search query are processed in a similar way that one might look through a thesaurus and a set of synonyms is searched in the index. In these systems, the list of terms to be searched is developed through statistical analysis of Web pages and thus is subject to a degree of imperfect assumptions about the exact relationship between any one term and a concept. All of these search rules and algorithms have as their aim retrieving a comprehensive set of Web pages for each search query.
To achieve precision, each Web search engine tries to assure that the documents retrieved are focused on the search query. Relevancy scores and ranking are the two primary ways to achieve this precision. Frequency of occurrence of a term is one way of calculating relevancy. Another factor that Web engines take into account in determining relevancy is the degree to which other Web pages with the same search terms link to one another. In addition, in general, Web pages with more links to them are likely to have been found valuable by their users.
Web search engines that provide searches limited to particular fields can be effective in improving precision on search terms that might otherwise be too generic to be useful. For example, a search for the radio show, "This American Life," can be done several ways. If entered as plain text in a basic search on Google, This American Life produces over 475 million hits. When restricted to a particular field search with the title in quotation marks, allintitle:"This American Life" it produced over 400,000 hits, but the main homepage for the show is the first item retrieved. All the items on the first several screens referred specifically to the radio program, with links to podcasts and particular shows. However, even this strategy is not always effective. In spite of the simple-looking interface with its little search box, each Web search engine is supremely sensitive to the formatting of search queries. Seemingly minor differences in search string construction, such as, the presence or absence of quotation marks, plus signs, and spaces, can produce disproportionate differences in results. Experimentation with these and many other formulations in several search engines produces results that are sometimes difficult to understand in the context of how the search engine is expected to handle these searches. Experience and review are recommended as strategies for maximizing the effectiveness of any search engine for particular kinds of questions.
The above example is focused on what might be called the "known item search" in which the searcher is seeking one particular page. This type of search might also be called a closed-end search. It is a search for an item that will be recognized if retrieved. The hoped-for page may or may not exist on the Web, but the search statement is constructed to find it if it exists. This type of search is common when trying to find a corporation or an organization home page, for example. When one is searching for the Folger Shakespeare Library's home page, it will be clear if one has found it or not. By contrast, the more open-ended search is one in which the user is looking for information about a topic. The searcher may have to retrieve and examine many pages and make comparisons among them to determine which best serves the information need at this time. The search engines' indexing and retrieval decisions can have a profound effect on these searches as well. Stemming or synonym searching is likely a more helpful feature when trying to find information on a topic than when searching for a single known page. The open-ended search needs a different formulation of its search query than a closed-end search. In contrast to the tightly focused query for the known-item search, the open-ended search is likely to be more broadly constructed. It might include fairly general terms describing the topic and perhaps some synonyms for the key concept. As an example, a search for resources that discuss privacy issues on the Web will look different from a search for the privacy watchdog group, The Center for Democracy & Technology.
It is useful to realize that search engines are not working in isolation as they develop and refine their indexing and retrieval rules. With the increased importance of the Web for businesses and organizations, there is a dynamic interplay between search engines and the Web page designers. The search engines want to serve their users by providing the best, most relevant sites for each search. Web page designers want to assure their clients that their pages will be retrieved and displayed as many times as possible. Understanding that Web search engines look for frequency and density of keywords as part of their measures of relevancy, some Web page designers will repeat keywords and phrases many times throughout a document to assure their pages get indexed with these terms. To counteract this ploy, some Web search engines apply reduction formulas in their relevancy measures if search terms are repeated excessively. Although some search results may be affected somewhat by these competing strategies, the overall impact is relatively small.
It is important to remember that there are new Web search engines introduced and new features are added to current search engines nearly every other week. In addition, there is a growing set of metasearch sites, which simultaneously search on several search engines.
In the face of all this simultaneous choice and change, reference librarians seeking to use search engines more proficiently may need to pursue their goal through the parallel strategies of intensive and extensive searches. The intensive part of the strategy is to learn one or two search engines very well, so that the underlying logic of the indexing and retrieval system becomes clear. This quest can be facilitated by using some of the search engine tutorials available over the Web. The extensive part of the strategy is to selectively explore a new search engine or metasearch site regularly. For example, one could conduct the same search on a familiar search engine and a completely new search engine. Comparing the results and the facility with which each one identified useful resources can expand and sharpen one's search skills. With this approach, one gains the ease and skills with searches that will produce efficient and effective search techniques.
Box 6.6. Flaw or Feature?
Discussions of the Web and its search engines usually end up revolving around a question: Is it a flaw or is it a feature? Frequently, an aspect of a search engine, automatic stemming for example, will be a flaw for one search and a feature for another.
Whatever the search parameters of search engines, they are largely limited to searches on the "surface Web" or the "Publicly Indexable Web," that is the Web of static URLs. Search engines cannot mine the riches of the "deep Web," which consists of structured databases that have query interfaces that generate dynamic links. In the deep or hidden or invisible Web, these databases respond with lists of information in response to a direct search query. The structured databases of the deep Web include such widely used sites as Amazon.com, bn.com, or Edmonds.com and the other subject or content specific databases that must be queried directly by users to create specific responses that meet the criteria. Some parts of the deep Web are the proprietary databases that form the central resources of libraries, such as Lexis-Nexis and JSTOR. Measuring the invisible Web presents difficulties that are discussed in a paper by Yanbo Ru and Ellis Horowitz, "Indexing the Invisible Web: A Survey." Chris Sherman and Gary Price identify the multiple ways in which the invisible Web remains invisible to Web crawlers and name four kinds of invisibility, the "Opaque" Web, the Private Web, the Proprietary Web, and the Truly Invisible Web.
Given the mix of proprietary and subscription materials that form the invisible Web, finding a single approach to making this material available to their users is a challenge for reference librarians. Most users are not familiar with the multiple databases libraries offer to provide this access. And the databases are frequently focused on a single discipline or a few interrelated disciplines. One strategy that can be considered is incorporating specialized Web search engines into the reference mix. One such program is Google Scholar. Windows Live Search: Academic is another program under development in 2007. Google Scholar is a program created by Google to index part of the deep Web and to make the results available to scholars and researchers everywhere. Google has arranged with publishers of scholarly material to enable the identification of their materials through Google Scholar. Automated processing of journal articles and book chapters creates an index to the material included in these arrangements. The number of publishers who have joined Google Scholar is growing as they realize the wider audiences their publications can reach through this service. Of course, Google Scholar cannot offer the full text of the material it indexes. Access to the full content of the books and journals depends on the individual library of each user. Google Scholar offers libraries an opportunity to make the links to the full text of the articles through their "Library Links Program."
Box 6.7. The Deep Web "By the Numbers"
The deep Web is a subject of a great deal of research. Some of that research is being conducted at the University of Illinois at Urbana-Champaign. An especially useful article is by Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, and Zhen Zhang, entitled "Structured Databases on the Web: Observations and Implications," ACM SIGMOD Record 33 (September 2004): 61-70. It provides statistical measurements of the deep and open Web.
Standards
The Internet and the Web provide reference librarians with the connectivity to a wide variety of electronic bibliographical resources. Many of these resources, like online catalogs and indexing and abstracting services, are supported through proprietary systems developed by vendors and suppliers. These proprietary systems require that the resources be searched using the disparate search interfaces provided by each system developer. The multiplicity of search interfaces is a barrier to efficient use of the resources. The solution to the problem is a third-party system which can interoperate with disparate systems and provide a single user interface. This is an information retrieval standard, known in the United States as the ANSI/NISO Z39.50-2003, Information Retrieval: Application Service Definition and Protocol Specification, internationally as the ISO 23950, and informally as Z39.50.
Z39.50
The Z39.50 standard provides a client-server electronic information system design "blueprint." Information system and database developers are using this blueprint to design their systems in ways that make it possible for users of one system to access databases in another system using one set of commands, without regard to the hardware or software of the host system. Each system or database can be designed with its own unique features and search functions and search interface, but each of these is mapped to a corresponding Z39.50 element. Because Z39.50 includes the information needed to program an encoder/decoder for translating commands from one system to another, it provides a translating function that allows one system to correctly interoperate with another. When each uses Z39.50, it allows users of each of these OPACs to maximize the search options supported by the other and creates the largest possible common ground. It also provides for the translation of the incoming record(s) to a standard format that will be readily understood by the user. While Z39.50 maximizes the common ground between dissimilar or different systems, it cannot supply the missing pieces from either. An option has to exist in both systems in order to be available through the Z39.50 interface. For example, if one OPAC supports adjacency searching, and another does not, the option to do adjacency searching will not be offered through the Z39.50 interface.
Z39.50 shows the promise and the limitations of interoperability. Z39.50 is a big advance in the move toward mutual intelligibility of different catalog systems. However, as with any translating device, Z39.50 cannot be truly idiomatic. Each system has variations in the search engine and in the specific search choices in system design and implementation. It is important that reference librarians understand that variations in indexing and retrieval decisions will have real impacts on the search results. This is a particularly subtle and important area of user education as well. Since Z39.50 can provide a familiar "look and feel" to resources, highlighting this distinction for the users will be a formidable challenge indeed.
Metasearch
With the growing number of databases and other electronic resources at every library Web site, libraries are looking for technical solutions to aid less-experienced users in finding the information they need with the least amount of frustration and difficulty. One approach that is being tried is implementing a "metasearch" engine, or federated search engine, or cross-database search engine such as MetaLib and WebFeat. Metasearch engines search across disparate electronic information sources simultaneously and provide results matching the inquiry from them. Metasearch engines use a variety of protocols and standards in order to conduct their searches. They typically conduct the search in one of two major ways. One approach, exemplified by WebFeat, transfers the search terms into the different databases' search boxes, to be operated on according to each database's native interface. The other approach, implemented by MetaLib, interrogates the databases using Z39.50 or XML protocols. In addition, each database in the metasearch pool must have implemented OpenURL and Digital Object Identifier (DOI) technologies which provide durable and consistent links to items in a database in order for the search engine to work across the database boundaries. A fuller discussion of metasearching and the standards and protocols associated with it is in a two-part article called "NISO Metasearch Initiative Targets Next Generation of Standards and Best Practices" in Against the Grain.
The primary goal of a federated search engine is to reduce the information load on its user. The users need not know that a search on their topic requires looking at the online catalog and three different databases with their multiple interfaces in order to find useful information. Typically a library will set up their federated search engine page to provide the user with the option of searching for a cluster of information resources for a single discipline or subject area. The user will be able to click on a subject button, for example, and type in a term or phrase and then review the results of the federated search.
There are limits to the success of metasearching because of the innate differences across different databases. One database included in a subject area may be a multidisciplinary database, such as EBSCO's Academic Search or Infotrac's Expanded Academic, while another one may be more focused on the particular subject area, such as MLA International Bibliography. Another database likely to be included in any metasearch is the library's online catalog. A subject encyclopedia may yet be another type of information resource in the mix. When a single search term or phrase is executed across these sources, issues of granularity and scope arise. The subject-focused database will likely have more specialized terms of a finer granularity than the multidisciplinary database. And the online catalog has a mix of terminology. For example, Shakespeare gets a multitude of specific subject headings in addition to many appearances as a keyword. Depending on how the original search term was entered and interpreted by the federated search engine, the results across databases may underrepresent or overrepresent the amount of useful information in each. In any case, once the results are presented to the user, they have to be reviewed in some kind of sequence. The results are displayed next to each database's name. The user then has to select among the databases on the list to see the material. The same database names that are completely opaque to the user in the primary library Web pages are displayed in the metasearch results as well. There is little guidance provided to the user on why to select one source rather than another. The order of the databases or the number of "hits" or the wild guess about the "best" database may all factor into the user's decision making. This can result in a fairly random selection of materials. The hope on the part of the library is that because these databases are ones selected by the library for subscription that the quality of the materials will be substantially higher on average than that the user would find independently in a regular open Web search. In addition, the databases available through a federated search are also likely to have links to the full text of the articles incorporated into the results, so that the user can access the full content with relative ease.
Future of Standardization
The Web has provided both the means and the necessity for this standardization. These standards and associated protocols will foster the development of compatible information resources and the integration of information from disparate sources. The reference librarian will be able to master these formats and retrieval mechanisms and be in a position to navigate the networks with expectations of success.
ADDITIONAL READINGS [ADDED IN 4th EDITION]
-
Digital Preservation: The National Digital Information Infrastructure and Preservation Program. Available: http://www.digitalpreservation.gov/.
This site presents information about the National Digital Information Infrastructure and Preservation Program, sponsored by the Library of Congress. Created in December 2000 under Public Law 106-554, the program will provide, through a collaborative effort with federal agencies and other institutions, a national focus on important policy, standards, and technical components necessary to preserve digital content. The program's mission statement is to "develop a national strategy to collect, archive, and preserve the growing amounts of digital content, especially materials that are created only in digital formats, for current and future generations."
- Hock, Randolph. The Extreme Searcher's Guide to Web Search Engine : A Handbook for the Serious Searcher. 2d ed. Medford, N.J.: CyberAge Books, 2007. 326p.
This book presents a description of the capabilities and limitations of Web search engines as of 2007. It offers a general discussion of common search engine functions and then provides a chapter for each major search engine. The chapters reveal some of the less-obvious search and ranking algorithms at play. The systematic analysis gives a clear understanding of how things really work on the Web. - McDermott, Irene E., and Barbara Quint, ed. The Librarian's Internet Survival Guide Strategies for the High-Tech Reference Desk. 2d ed. Medford, N.J.:
Information Today, Inc., 2006. 298p.
This is an updated edition of the popular 2002 title. The first nine chapters cover "Ready Reference on the Web: Resources for Patrons," while the six chapters in Part 2 offers advice for reference librarians on managing email, teaching the Internet, making and maintaining Web pages, making the Web accessible to the disabled, computer troubleshoorting, and keeping up with changes on the Web. - Notess, Greg R. Teaching Web Search Skills: Techniques and Strategies of Top Trainers. Medford, N.J.: Information Today, Inc., 2006. 344p.
This book is written by the foremost library expert on Web search engines. It will be most useful to the information professional who is training Internet users to search the Web. - Woodyard-Robinson, Deborah, ed. "Digital Preservation: Finding Balance." Library Trends: Special Issue. 54 (Summer 2005). 172p.
This special issue of Library Trends addresses many of the challenges and issues in digital preservation: technical issues, identification of materials, metadata, as well as access. It also has articles from libraries that have started some of the solutions.
