(printed in: Read Me! Filtered by Nettime. Brooklyn 1998, S. 29-36. It first appeared in German in Telepolis-Online (http://www.heise.de/tp/te/1135/findex.html). Printed in: W. H.: Suchmaschinen. Metamedien im Internet? In: Becker, Barbara; Paetau, Michael (eds.): Virtualisierung des Sozialen. Frankfurt/NY 1997, pp. 185-202. The text has been revised for translation. ). - website creation date 22. 12. 98, update: 26. 04. 01, expiration date 26. 04. 2004, 31 KB, url: www.uni-paderborn.de/~winkler/suchm_e.html, language: Engl. © H. Winkler 1998, home -
( keys:media, theory, computers, internet, www, Alta Vista, Yahoo, language, Camillo )

Hartmut Winkler

Search Engines

Metamedia on the Internet?

Translation: Tom Morrison

We use them daily, and don’t know what we’re doing. We don’t know who operates them or why, don’t know how they’re structured, and little about the way they function. It’s a classic case of the black-box - and all the same, we’re abjectly grateful for their existence.
Where, after all, would we be without them? Now that the expanse of Web offerings has proliferated into the immeasurable, isn’t anything that facilitates access useful? After all, instantly available information is one of the fundamental Utopias of the data universe.
Nevertheless, I think the engines are worth some consideration, and propose research should concentrate on the following points. First, the specific impetus of blindness that determines our handling of these engines. Second, the conspicuously central, even ‘powerful’ position the engines meanwhile occupy on the Net - and this question is relevant if one wants to forecast the medium’s development trends. Third, I am interested in the structural assumptions on which the various search engines are based. Fourth, and finally, a reference to language and linguistic theory which shifts the engines into a new perspective and a different line of tradition.

1

The main reason search engines occupy a central position on the Net is that they are started infinitely often; in the case of Altavista, accessed 32 million times per workday, if the published statistics can be trusted. (2) Individual users see the entry of a search command as nothing more than a launching pad to get something else, but to have attracted so many users to a single address signifies a great success. The direct economic consequence is that these contacts can be sold, making the search engines eminently suitable for the placement of advertising and therefore among the few Net businesses which are in fact profitable. With remarkable openness, Yahoo writes: "Yahoo! also announced that its registered user base grew to more than 18 million members [...] reflecting the number of people who have submitted personal data for Yahoo!'s universal registration process. [...] 'We continued to build on the strong distribution platform we deliver to advertisers, merchants, and content providers.'" (3)
Secondly, and even more importantly, the frequency of access means the overall Net architecture has undergone considerable rearrangement. 32 million users per day signify a thrust in the direction of centralization. This should put on the alert all those who recently emphasized the decentral, anti-hierarchic character of the Net, and link its universal accessibility with far-reaching hopes for basis democracy.

All the same - and that brings me to my second point - this centralization is not experienced as such. The search engines can occupy such a central position only because they are assumed to be neutral in a certain way. Offering a service as opposed to content, they appear as neutral mediators. Is the mediator in fact neutral?

2

The question must be addressed first of all to the design of the search engines. Steve Steinberg, (4) my main source for the factual information in the following text, described in an article for Wired the things normal users don’t know about the search engines and, even more importantly, what they think they don’t need to know in order to use them expediently. Steinberg’s first finding is that providers keep secret the exact algorithm on which their functioning is based.(5) Since the companies in question are private enterprises and the algorithms are part of their productive assets, the competition has, above all, to be kept at a distance; only very general information is disclosed to the public, the details remain in the dark of the black-box. So if we operate the search engines with relative blindness, there are good economic reasons for this.

Three basic types of search engine can be distinguished. The first type is based on a system of pre-defined and hierarchically ordered keywords. Yahoo, for instance, employs human coders to assign new Websites to the categories; the network addresses are delivered by e-mail messages or hunted down by a search program known as a spider. In 1996, the company registered 200,000 Web documents in this way. (6)
The above figure alone indicates that coding through human experts is quick to meet its quantitative limitations. Of the estimated total volume of 30-50 million documents available on the Net in 1996 (7), Yahoo was offering some 0.4%; current estimates suggest that the total volume has meanwhile grown to 320 million Websites. (8)

However, the problems of the classification system itself are even more serious. The 20,000 keywords chosen by Yahoo are known in-house (with restrained self-irony?) as ‘the ontology’. But what or who would be in a position to guarantee the uniformity and inner coherence of such a hierarchy of terms. If pollution, for example, is listed under 'Society and Culture'/'Environment and Nature'/'Pollution', then the logic can be accepted to some degree, but every complicated case will lead to classificatory conflicts that can no longer be solved even by supplementary cross-references.
The construction of the hierarchy appears as a rather hybrid project, but its aim is to harness to a uniform system of categories millions of completely heterogenous contributions from virtually every area of human knowledge. Without regard to their perspecitivity, their contradictions and rivalries.
Yahoo’s 'ontology' is thus the encumbered heir of those real ontologies whose recurrent failure can be traced throughout the history of philosophy. And the utilitarian context alone explains why the philosophical problem in new guise failed to be identified, and has been re-installed yet again with supreme naivety. If the worst comes to the worst, you don’t find what you’re looking for - that the damage is limited is what separates Yahoo from problems of philosophy.

The second type of search engine manages without a pre-defined classication system and, even more importantly, without human coders. Systems like 'AltaVista', 'Inktomi' or 'Lycos' generate an 'inverted index' by analysing the texts located. The search method employed is the full-text variant - word for word, meaning that in the end every single term used in the original text is contained in the index and available as a search word. This is less technically demanding than it might appear. For every text analysed, a row is created in a huge cross-connected table, while the columns represent the general vocabulary; if a word is used in the text, a bit is set to ‘yes’, or the number of usages is noted. An abstract copy of the text is made in this way, condensed to c. 4% of its original size. The search inquiries now only make use of the table.
Since the system is fully automatic, the Alta Vista spider can evaluate 6 million Net documents every day. At present, some 125 million texts are represented in the system.(9)

The results of a search are, in fact, impressive. AltaVista delivers extremely useful hit lists, ordered according to an internal priority system. And those who found what they were looking for are unlikely to be offended by the fact that AltaVista too keeps its algorithm under wraps.
There are some problems nevertheless. It is conspicuous that even slight variations in the query produce wholly different feedback; if you try out various queries for a document you already know, you will notice that one and the same document is sometimes displayed with high priority, sometimes with lower priority, and sometimes not at all. This is irritating, to say the least.
The consequence, in general terms, is that often one does not know how to judge the result of a search objectively - it remains unclear which documents the system does not supply because either the spider has failed to locate them or because the evaluation algorithm does indeed work otherwise than presumed. Even if the program boastfully claims to be ‘searching the web’, the singular form of the noun is illusory, of course, if you consider the fact that even 125 million texts are only a specific section of the overall expanse. Furthermore, users for their part can register only the first 10, 50 or, at most, 100 entries. They too scarcely have the possibility of estimating how this section relates to the rest of the expanse in terms of content.

The second, and main, problem is however present already in the basic assumption. A mechanical keyword search presupposes that only such questions will be posed as are able to be clearly formulated in words, and differentiated and substantiated through further keywords. Similarly, nobody will expect that the system is able to include concepts of similar meaning alongside the query, or can exclude homonyms. Search engines of this type are wholly insensible to questions of semantics or, to make it more clear: their very point is to exclude semantic problems of the type evident with Yahoo. Yet that is not to say that the problems themselves are eradicated. They are imposed on the users through the burden of having to reduce their questions to unambiguous strings of significants, of having to be satisfied with the mechanically selected result. All questions unable to be reduced to keywords fall through the screen of the feasible. Technical and scientific termini are relatively suitable for such a search, humanistic subjects are less suitable, and once again this emerges as that ‘soft’ - all too soft - sphere which should be circumvented from the outset, if one is unwilling to fall into the abyss.

But the problem of semantics has not been ignored, and efforts in this direction have led to the third type of search engine. Systems like ‘Excite’ by Architext, or ‘Smart’, claim to search no longer mechanically with strings of significants, but on the basis of a factual semantic model. In order to be able to discriminate between articles on oil films and ones on cinema films, such programs examine the context in which the respective concepts figure.
"The idea is to take the inverted index of the Web, with its rows of documents and columns of keywords, and compress it so that documents with roughly similar profiles are clustered together - even if one uses the word 'movie' and one uses 'film' - because they have many other words in common."(10) The result is a matrix where the columns now represent concepts instead of mechanical keywords.
The exciting thing about this type of engine is that it progresses from mechanical keywords to content-related concepts; and also that it obtains its categories solely on the basis of the entered texts, of a statistical evaluation of the documents.
"[The engine] learns about subject categories from the bottom up, instead of imposing an order from the top down. It is a self-organizing system. [...] To come up with subject categories, Architext makes only one assumption: words that frequently occur together are somehow related. As the corpus changes - as new connections emerge between, say O. J. Simpson and murder - the classification scheme automatically adjusts. The subject categories reflect the text itself"; "this eliminates two of the biggest criticisms of library classification: that every scheme has a point of view, and that every scheme will be constantly struggling against obsolescence." (11)

Other designs, such as the 'Context' system by Oracle, attempt to incorporate analyses of the syntax, and by doing so find themselves in the minefield of how to model natural language - a problem that has been worked upon in the field of AI since the 1960s, without convincing results having been produced so far. The evaluation of such systems is more than difficult; and it is even more difficult to make forecasts about the possible chances of developments.
For that reason, I would like to shift the focus of the question from the presented systems’ mode of function and their implications and limitations to the sociocultural question of what their meaning is, what their actual project is in the concurrence of discourses and media.

3

The path from the hierarchic ontologies over the keyword search and on to the semantic systems shows, in fact, that it is a matter of a very fundamental question beyond the pragmatic usage processes. The search engines are not a random ‘tool’ that supplements the presented texts and facilitates their handling. On the contrary, they appear as a systematic counterpart on which the texts are reliant in the sense of a reciprocal and systematic interrelation.
My assertion is that the search engines occupy exactly that position which -- in the case of non-machine-mediated communication -- can be claimed by the system of language. (And that is the main reason why search engines interest me.)

Language, as Saussure clearly showed, breaks down into two modes of being, two aggregate states. Opposite the linear, materialized texts in the external world -- utterances, speech events, written matter -- exists the semantic system that, as a knowledge, as a language competence, has its spatially distributed seat in the minds of the language users. Minds and texts are therefore always opposite each other.
If access to the data network is now organized over systems based on vocabulary, and if these systems are being advanced in the direction of semantically qualifying machines, then this means that language itself, the semantic system, the lexicon, is to be liberated from the minds and technically implemented in the external world. In other words: not just the texts are to be filed in the computerized networks, but the entire linguistic system. The search engines, with all their flaws and contradictions, are a kind of advance payment on this project.
Search engines, then, represent language in the network. And this has completely changed the emphasis. The engines face the texts not as additional tools but as the ‘actual’ structure which the texts merely serve; a machine for opening up, but at the same time a condensate that represents the body of texts as a whole.

4

The conjecture that it is a matter of the language admits a new perspective on the internal organization of search engines. And it becomes clear that engines have prominent predecessors in the history of knowledge and historical notions of language.
It is difficult not to see in the hierarchically composed structure of the Yahoo pyramid of concepts those medieval models of the world described for us by writers such as Bolzoni in her history of mnemonics. (12) A large 14th-century panel shows the figure of Jesus in the center of the tree of life, whose branches and leaves all contain stations in his earthly existence, his path to the Cross and his transfiguration. A second picture, this time from the 13th century, shows a horse-mounted knight who is riding, sword drawn, toward the Seven Deadly Sins, which are divided up into a scheme of fields branching of step-by-step into the infinite diversity of the individual sins.(13) Bolzoni explains that such schemes initially served didactic mnemonic purposes; order and visualization made it easier to note the complex connections. But their actual meaning goes further. The implicit ambition of these systems was to bring the things of the world into a consistent scheme, namely into a necessarily hierarchic scheme that no less necessarily culminated in the concept of God. Only the concept of God was capable of including all other concepts and furnishing a stable center for the pyramidic order. The linguistic structure (the cathedral of concepts) and the architecture of knowledge were superimposed over each other in this ‘order of things’. This metaphysical notion of language has become largely alien to us today. But is it really alien?

As far as Yahoo’s surface is concerned, if you will permit the abrupt return to my subject, it manages without an organizing center. The user faces 14, not one, central categories from which the sub-categories branch off. Thus, the pyramid has lost its tip. Or would it be more appropriate to ask what has taken God’s place?
In a model of the world created by Robert Fludds, (14) an English encyclopedist of the Renaissance, God had already abandoned the center position. Retained has been a system of strictly concentric rings that contains the things of the world, encompassing a range from minerals to the plants and animals of nature up to the human arts and finally the planetary spheres. The center is occupied by a schematic diagram of the earth, a forerunner of that blue ball the astronauts radio-relayed to Earth. The representation looks like a mandala in which viewers can absorb themselves in order to take up contact with a cosmic whole. The new, secularized solution becomes even more distinct in the memory theater of the Italian Camillo which, frequently discussed in the meantime, itself belongs to the history of technical media. At the beginning of the 16th century, Camillo built a wooden construction resembling a small, round theater.(15) Those who ventured inside were confronted by a panel of 7x7 pictures Camillo had commissioned from highly respected painters of the period. The horizontal division corresponded to the seven planetary spheres, the vertical division to seven stages of development from the first principles up to the elements, to the natural world, to the human-being, to the arts and, finally, the sciences. In this way, every field in the matrix represented a certain aspect of the cosmos. The images were merely there to convey the general picture, whereas behind them were compartments with the texts written by the great writers and philosophers. It was in these compartments, then, that the user looked for sources, concepts and further information. To this extent, the whole thing was a system of access, and the analogy with search engines becomes evident in the clear separation between the access to the texts and the texts themselves.
Camillo’s theatre has finally brought the human-being, the viewer, into the center of the construction. The surface of the images is oriented to his view, and solely the beholder’s perspective joins up the 49 fields in the matrix. Exactly that appears to me to be the logic on which Yahoo is based. The very lack of the pinnacle in the pyramid of concepts defines the position taken by the user. Like in the optical system of the central perspective, the ‘royal overlooking position’ is reserved for the user/beholder.
Yahoo is indeed an ‘ontology’; but not because Yahoo and likewise ontologies are arbitrary. It is more because they keep things in their place, and define for the user a position relative to this place. Its ontology offers an ordered world. And anything threatening to be lost in the chaotic variety of available texts can take one final respite in the order of the search engine.

The solution, however, is historically outdated, and has been abandoned in the history of philosophy. Because any positively defined hierarchy of concepts is perspectival and arbitrary, it soon reveals those points of friction which represent the beginning of its end. Does this make the solution of the keyword- or semantics-based engines more modern?
It must indeed appear to be so at first glance. The strategy of making the search words dependent on the empirically collected content of the network documents - the texts - imitates the mechanism of language itself. Or the mechanism, to be more precise, by which language arrives at its concepts.
Linguistic theory tells us that the synchronous system of language is created through the accumulation and condensation of an infinite multitude of concrete utterances. The place where condensation takes place is the language user’s memory, where the concrete utterances are submerged; linear texts are obliviated into the structure of our language capability; on the basis of concrete texts, this structure is subject to constant modification and differentiation. Our faculty of language is an abstract copy of speaking - speech and language (discourse and system) are systematically cross-linked. (16)
What this means for the isolated concept is that it accumulates whatever the tangible contexts provide as meaning. It isn’t a one-time act of definition that assigns it a place in the semantic system, but the disorderly chain of its usages; concepts stand for and typify contexts, concepts encapsulate past contexts.(17)

The semantic search engines imitate this accumulation by typifying contexts in order to arrive at concepts - in this case the search concepts. As outlined above, the table of search words is created as a condensed, cumulated copy of the texts. A statistical algorithm draws together comparable contexts, typifies them, and assigns them to the search concepts as the equivalent of their meaning.
A system imbued with such dynamism is superior to the rigidly pre-defined systems, even if the statistical algorithm only imperfectly models the mechanisms of natural language. More complex, closer to intuition, it is bound to offer less centers of friction. So, once again, what’s the objection?

5

It’s important to remember that, despite all the advances made, the actual fundamental order has remained constant. Just as in Camillo’s wooden theater, we are dealing not with only two instances - a set of reading/writing/searching subjects approaching a second set of written texts - but also with a third instance, namely a system of access that has placed itself between the first two like a grid, or raster.
And if the access system in Camillo’s media machine served to break down the infinite expanse of texts into a manageable number of categories from which the position - from a strictly central perspective - was defined for the observing subject, then this fundamental order remains intact also.
This image makes it clear that it is not necessarily better if the raster cannot be felt. It’s almost the other way round: the less resistance offered by the access system, the more neutral, transparent and weightless it seems, and the more plausible appears the suspicion that it cannot be a question of the nature of thing, but of a naturalization strategy.
The raster of categories must purport to be transparent if it does not want to rouse the problems that Yahoo rouses. To avoid the reproach of being arbitrary and exercising a structuring influence on the contents accessed, the raster must instill in the users the impression of being purely a ‘tool’ subject only to utility - the key in the customers’ hand that opens any Sesame, a compliant genie with no ambitions of its own.
This puts the veil of secrecy cast over the algorithms in a somewhat different light. Far more important than the rivalry between different product suppliers is the wish to actually dispose over a neutral, transparent access machine - and this wish is something the makers share with their customers, and probably with us all. At the basis of the constellation emerges an illusion which organizes the discourse.
Since there is no such thing as algorithms without their own weight, the meta-discourse has to help them out and salvage transparency by means of mere assertions. In the usage of the salutary singular (‘searching the Web’), in the way the algorithms are kept under wraps, in the emphasis on the performance as opposed to the limitations that might be more defining, and in the routine promises that, thanks to Artificial Intelligence, new and even more powerful systems are in the pipeline.(18) In the unawareness and unwillingness to know on the part of the customers, and in the primacy of a practice which mostly, in any case, doesn’t know what it’s doing.

Data processing - and one feels almost cynical in bringing up this point - was propagated with the ideal of creating a very different type of transparency. The promise was to create only structures which were in principle able to be understood - the opposite, in fact, of natural language; to confine itself to the structural side of things, but to escribe this in a way that would not only admit analysis, but apparently include the latter from the outset. If programs have now, as Kittler correctly notes, begun to proliferate like natural-language texts, then this is not because the programs (and already even the search engines) have been infected by the natural-language texts. It is because of our need for both: for unlimited complexity and the narcissistic pleasure of having an overview, the variety of speaking and the transparency with regard to the objects, a language without metaphysical hierarchic centering that still maintains its unquestionable coherence.

That our wish is once again doomed to failure is clear from the fact that any number of search engines of different design are competing with each other in the meantime, and that meta-search engines are now said to be able to search through search engines. So there we sit on God’s deserted throne, opposite us the infinite universes of texts, in our hands a few glittering, but deficient, machines. And we feel uneasy.





Notes:

(2) Information acc. to AltaVista, http://www.doubleclick.net, 28 Aug. 98. back

(3) http://www.yahoo.com/docs/pr/release192.html, 28 Aug. 98. back

(4) Steinberg, Steve G.: Seek and Ye Shall Find (Maybe). In: Wired, No. 4.05. May 1996, pp. 108-114, 174-182, as well as in the on-line issue: http://www.wired.com/wired/4.05/features/indexweb.html.
Tilman Baumgärtel has presented a second investigation: B., T.: Reisen ohne Karte. Wie funktionieren Suchmaschinen? Schriftenreihe des Wissenschaftszentrums Berlin für Sozialforschung, 1998. back

(5) Steinberg, op. cit., p. 175. back

(6) Interestingly, the self-presentation on the company’s Website offers no current information. back

(7) Steinberg, op. cit., p. 113. back

(8) Figures acc. to @-online today, No. 3/98. back

(9) Information acc. to AltaVista:
http://www.altavista.digital.com/av/ content/about_our_technology.html, 28 Aug. 98. back

(10) Steinberg, op. cit., p. 175. back

(11) Ibid. (additions by H.W.). back

(12) Bolzoni, Lina: The Play of Images. The Art of Memory from Its Origins to the Seventeenth Century. In: Corsi, Pietro (Hg.): The Enchanted Loom. Chapters in the History of Neuroscience. New York/Oxford: Oxford Univisersity Press 1991, pp. 16-65. back

(13) Ibid., pp. 27-29. back

(14) 'Integrae Naturae speculum artisque imago' (1617), British Library, London. back

(15) See, for example, Yates, Frances A.: Gedächtnis und Erinnern. Mnemonik von Aristoteles bis Shakespeare. Weinheim 1991, pp. 123ff (Engl. OE 1966). back

(16) The deliberations upon language merely touched upon here are derived from my book Docuverse. Zur Medientheorie der Computer. München: Boer 1997. The language model is outlined in the first chapter, the ideas on cumulation and condensation in the fourth chapter (http://www.rz.uni-frankfurt. de/~winkler). back

(17) "It is indeed a characteristic of language - and another aspect of the 'problem of the word' - that it has this constant but never fully realised tendency to encapsulate a kind of complete (but concentrated, compressed) 'argument' in every word: a tendency which is also intrinsically condensatory. Even the most ordinary word, lamp for instance, is the meeting-point for several 'ideas' [...] each of which, if it were unravelled, or decondensed, would require a whole sentence". "Past condensations meet in each word of the language [...] this is to define the lexicon itself as the product of an enormous condensation". (Metz, Christian: The Imaginary Signifier. Bloomington 1982, pp. 225, 239 (French OE 1973-76, first published in book form in 1977). back

(18) The present state of debate is represented by systems such as PointCast - an agent program that searches through the Net on behalf of individual users, and is equipped with their priorities (www.pointcast.com), or NetSum, a program made by the British Telecom Natural Language Labs, which automatically generates abstracts on the basis of language statistics. back