Class Lecture Notes: H. P. Luhn and Automatic Indexing --
References to the Early Years of Automatic Indexing and Information Retrieval

Organizing and Providing Access to Information -- LIS 391D.2 -- Spring 1998

horizontal rule

I. Introduction

  1. Where we are starting
  2. The basic and simplest concept of automatic indexing developed in the 1950s was the KWIC or Keyword in Context index based on permutations of significant words in titles, abstracts or full text -- manipulated by machine. The first major report on the application of this indexing concept occurred at the International Conference on Scientific Information (ICSI) held in Washington, D. C. in November of 1958. The paper was not the sensational product; the actual demonstration of the method was the sensation of the conference.

    Hans Peter Luhn and Phyllis Baxendale have been deservedly credited as the pioneers in this area of automatic indexing. Luhn developed the concept with suggestions for auto-abstracting, auto-encoding and auto indexing. Baxendale developed auto-indexing techniques that identified topic sentences, and she developed methods of automatic phrase selection and syntactical deletion. But many others of the day were involved and much work was going on simultaneously across the nation.

    Among those others is Herbert Marvin Ohlman who was the inventor of Permuterm (or permuted indexing) is reviewed at the Information Science Pioneers of North America Web site. According to Robert V. Williams, Professor and Director of the Office of Research, College of Mass communication and Information Studies at the University of South Carolina, "Ohlman also presented a paper and had prepared in advance of, and distributed at, the 1958 Washington D.C. meeting, a complete permuterm index to the proceedings of the conference. Permuterm indexing is pretty much the same as KWIC indexing; Luhn's term for the approach happend to be more "catchy" and stuck but Ohlman's work was just as important, and possibly, preceded Luhn's work on KWIC."

    In the 1958-1959 time period many minds were conjuring up the same ideas within months of one another

  3. Where we end up
  4. We end up with a wide range of automated and semi-automated indexing techniques being developed, studied, debated, and evaluated including:

  5. In Between
  6. In between, I would like to describe the personality of H. P. Luhn and the exciting times of the 1950s and 1960s which led to the establishment of new fields of inquiry and terminology that we use today. Luhn helped set the stage to foster new ideas and debates on the evaluation of methods of access.

II. The Setting in the Early Years, the 1950s

  1. War and Threats of War
  2. As Dr. Miska noted last week, the war years provided tremendous impetus for ingenious work and put many minds together to provide for the practical defense of the country. After the war, the threat of attack remained fresh in the minds of those who had served and work continued to military efforts to device machines and techniques that would protect the country and make us strong. Men and women were both involved in the effort. SAGE was one of these efforts which fostered networking, teamwork principles, software development, and the systems approach to information systems.

  3. Supply of Machine-Readable Text
  4. Machine readable text was not in ready supply. Key punching was the method used to convert text into machine readable form. IBM and Remington Rand were leaders in this area. Just getting documents into machine-readable form was a feat. Getting authors to use machine compatible punctuation symbols was a feat. Work was primarily done manually.

  5. Nearly 100 computers in the U.S.
  6. It is hard to imagine such a time, but in the 1950s some 100 computers existed in the United States. They were quite a sensation.

  7. Influential organization names:
  8. Influential personal names:
  9. Explosion of Information
  10. -- necessitating mechanized routes to information retrieval

    Most importantly, there was at this time an information explosion. More and more documents were being produced and the manual techniques used for indexing, classifying, and organizing data were being overtaken. It was no longer possible for the human being, the information specialist to keep up.

    This problem became one of interest to H. P. Luhn in the last eighteen years of his life, 1948-1964. His articles normally begin with a reference to the problem of the information explosion as one for which practical solutions of a cost effective nature are needed. His thinking and inventiveness led to using machines to solve this problem.

    Tremendous explosion of dollars and output in terms of reports and papers and everyone wanted the latest report to read.

III. H. P. Luhn (1896-1964)

  1. His background and inventions
  2. Luhn came here from Europe where he had been the assistant manager of a textile mill. He was an inventor and well known for hard work and inventiveness. He held 80 patents, 10 which apply directly to the textile industry. Inventions which Luhn is responsible for include the computing gasoline pump, an inexpensive foldable raincoat, the early development of the American Airlines Sabre passenger reservation system, a cocktail recipe organizer, (optical coincidence principle of searching during Prohibition times or the peekaboo system), and the Luhn Scanner (first applied to Chemistry), an electronic searching selector which led to his interest in literary data processing.

    His UT connection is through his second wife who was a singer and a music teacher at UT.

  3. Work with IBM
  4. In 1941, he was asked by Thomas J. Watson, Sr. to join IBM as a development engineer. He created many machines for IBM, became interested in electrical engineering, and ended his career developing literary data processing techniques in the IBM Research Division.

    He held 20 patents at the time IBM hired him in 1941.

  5. Interest in Literary Data Processing

  6. In January of 1953, at age 57, he published the first article of his life in American Documentation and by this time he had begun attending meetings of librarians, literature chemists, and documentalists because he was excited about the possibilities of applying machines to literary data processing. From 1953 to 1956, he continued to invent machines and components for machines and Documentation (now called Information Retrieval) had pretty much become his full-time interest and area of research.

    Although he may have gotten a late start in contributing to the literature, he made up for it quickly. In 1963 when Carlos Cuadra studied the problems involved in the identification of key contributions to information science, he sought expert advice and analyzed references in both textbooks and bibliographies of the field. Luhn’s name led all the rest on three of four of the listings of major contributions as estimated by experts, he ranked fifth among the twenty-five highest scoring authors in terms of "publication densities" and he ranked in the top ten on the four lists showing the most frequently cited authors in the field.

    He was however, far more interested in practical contributions to the field than to the literature alone.

  7. 1958 was a milestone year in the life of H. P. Luhn
  8. Retired from IBM
  9. -- in 1961 and died in 1964 during his term as President of the American Documentation Institute

IV. Issues and Terminology of the day:

  1. Literary data processing
  2. Information Retrieval
  3. Trained intermediaries--specialists
  4. Automatic indexing
  5. KWIC or keyword-in-context indexing
  6. Indexing, Language and Meaning
  7. — "native" derived by statistical analyses from the collection itself. By statistical methods you can choose, enlarge or reduce your classes in such a fashion that each class has the probability of being equally populated as far as the overall collection is concerned.

    Alluded to meaning or semantics, language or use of words, and indexing.

  8. Auto-Abstracting
  9. Luhn felt that there was the unintentional danger of misinterpretations and distortion by the human abstractors. He felt that since their backgrounds and training vary, no two writers produce the same abstract. Machine abstracting, done in a matter of minutes, would release valuable talents of scientific writers, allowing them to work in more creative fields; the possibilities for errors would be eliminated and a single standard for abstracting would be established. This method was based on word frequency.

    Statistical information derived from word frequency and distribution was used by the machine to create a measure of significance first for the words, then for the sentences. Sentences scoring high are extracted and printed to become the auto-abstract.

    This was the start of more diversified paths having to do with sentence structures.

  10. Automatic Creation of Literature Abstracts -- An Experiment in Auto-Abstracting (1958)
  11. The objective was quick and accurate identification of the topic of a published paper and automation of the intellectual effort. The experimental work to eliminate human bias in abstracting was done using articles of 500 or 5000 words.

  12. Auto-encoding
  13. Auto-Encoding of Documents for Information Retrieval Systems (1959)
  14. Based on statistical procedures. One key limitation was that the document must be in machine-readable form. One dimensional patterns and multi-dimensional patterns are described along with creating the thesaurus.

    Word pairs were discussed. For example, "information retrieval" held a different meaning from Information and Retrieval alone.

  15. Normalization
  16. -- in the sense of synonym reduction, lookup of preferred entries, and other operations designed to simplify or standardize usage of indexing and documentary languages.

  17. Families of notions
  18. Concept of compiling families of terms; keywords stored in machine useful form and used for normalization of both indexing a search language.

  19. Selective dissemination of information
  20. IBM called it business intelligence, and Luhn wrote about it as Automated Intelligence Systems

  21. A Business Intelligence System (1958)
  22. -- To promote efficient communication -- the key to progress. Based on auto encoding, auto abstracting, and automatic creation of action points profiles.

V. Bibliography

--Items to help gain a broader understanding of Automatic Indexing

VI. Conclusion

Mary Elizabeth Stevens deserves our upmost respect for having compiled a most comprehensive bibliography on the topic of Automatic Indexing. Her knowledge of the field must have been the most complete of all. In her books she ends with these questions in 1964. These questions point to the flavor of the times.

Mary Elizabeth Stevens questions for her state of the art report:

  1. Can indexing be done by machine at all?
  2. Is what can be done by machine properly termed abstracting, indexing, or classifying?
  3. Is whatever is done by machine good enough, acceptable, as a good as, or better than the product of human operations?
  4. How can we evaluate acceptability or comparability for any indexing process whatsoever, whether carried out by man or by machine or by machine-aided manual operations?
  5. If an indexing product is to be achieved by machine, can it be done by statistical means alone, or must syntactic, semantic and pragmatic considerations be brought to bear in the machine decision-making processes?

If you are interested in more information on this topic, please refer to the Proceedings of the Conference on the History and Heritage of Science Information Systems (1998) available in full text at: http://www.libsci.sc.edu/bob/confprog/confprog.htm

horizontal rule

Return to Table of Contents

This page is created and maintained by Sue Soy ssoy@.ischool.utexas.edu
Last Updated 02/24/2003
© Copyright 1996 Susan K. Soy
Please feel free to copy and distribute freely for academic purposes with this notice and attribution.
All other rights reserved