(printed in: Read Me! Filtered by Nettime. Brooklyn 1998, S. 29-36. It
first appeared in German in Telepolis-Online (http://www.heise.de/tp/te/1135/findex.html).
Printed in: W. H.: Suchmaschinen. Metamedien im Internet? In: Becker, Barbara; Paetau,
Michael (eds.): Virtualisierung des Sozialen. Frankfurt/NY 1997, pp. 185-202. The text has
been revised for translation. ). - website creation date 22. 12. 98, update: 26.
04. 01,
expiration date 26. 04. 2004, 31 KB, url: www.uni-paderborn.de/~winkler/suchm_e.html,
language: Engl. © H. Winkler 1998, home -
( keys:media, theory, computers, internet, www, Alta Vista, Yahoo, language, Camillo )
Translation: Tom Morrison
We use them daily, and dont know what were doing. We dont know who
operates them or why, dont know how theyre structured, and little about the
way they function. Its a classic case of the black-box - and all the same,
were abjectly grateful for their existence.
Where, after all, would we be without them? Now that the expanse of Web offerings has
proliferated into the immeasurable, isnt anything that facilitates access useful?
After all, instantly available information is one of the fundamental Utopias of the data
universe.
Nevertheless, I think the engines are worth some consideration, and propose research
should concentrate on the following points. First, the specific impetus of blindness that
determines our handling of these engines. Second, the conspicuously central, even
powerful position the engines meanwhile occupy on the Net - and this question
is relevant if one wants to forecast the mediums development trends. Third, I am
interested in the structural assumptions on which the various search engines are based.
Fourth, and finally, a reference to language and linguistic theory which shifts the
engines into a new perspective and a different line of tradition.
1
The main reason search engines occupy a central position on the Net is that they are
started infinitely often; in the case of Altavista, accessed 32 million times per workday,
if the published statistics can be trusted. (2)
Individual users see the entry of a search command as nothing more than a launching pad to
get something else, but to have attracted so many users to a single address signifies a
great success. The direct economic consequence is that these contacts can be sold, making
the search engines eminently suitable for the placement of advertising and therefore among
the few Net businesses which are in fact profitable. With remarkable openness, Yahoo
writes: "Yahoo! also announced that its registered user base grew to more
than 18 million members [...] reflecting the number of people who have submitted personal
data for Yahoo!'s universal registration process. [...] 'We continued to build on the
strong distribution platform we deliver to advertisers, merchants, and content
providers.'" (3)
Secondly, and even more importantly, the frequency of access means the overall Net
architecture has undergone considerable rearrangement. 32 million users per day signify a
thrust in the direction of centralization. This should put on the alert all those who
recently emphasized the decentral, anti-hierarchic character of the Net, and link its
universal accessibility with far-reaching hopes for basis democracy.
All the same - and that brings me to my second point - this centralization is not
experienced as such. The search engines can occupy such a central position only because
they are assumed to be neutral in a certain way. Offering a service as opposed to content,
they appear as neutral mediators. Is the mediator in fact neutral?
2
The question must be addressed first of all to the design of the search engines. Steve
Steinberg, (4) my main source for the factual
information in the following text, described in an article for Wired the things
normal users dont know about the search engines and, even more importantly, what
they think they dont need to know in order to use them expediently. Steinbergs
first finding is that providers keep secret the exact algorithm on which their functioning
is based.(5) Since the companies in question are
private enterprises and the algorithms are part of their productive assets, the
competition has, above all, to be kept at a distance; only very general information is
disclosed to the public, the details remain in the dark of the black-box. So if we operate
the search engines with relative blindness, there are good economic reasons for this.
Three basic types of search engine can be distinguished. The first type is based on a
system of pre-defined and hierarchically ordered keywords. Yahoo, for instance, employs
human coders to assign new Websites to the categories; the network addresses are delivered
by e-mail messages or hunted down by a search program known as a spider. In 1996, the
company registered 200,000 Web documents in this way. (6)
The above figure alone indicates that coding through human experts is quick to meet its
quantitative limitations. Of the estimated total volume of 30-50 million documents
available on the Net in 1996 (7), Yahoo was
offering some 0.4%; current estimates suggest that the total volume has meanwhile grown to
320 million Websites. (8)
However, the problems of the classification system itself are even more serious. The
20,000 keywords chosen by Yahoo are known in-house (with restrained self-irony?) as
the ontology. But what or who would be in a position to guarantee the
uniformity and inner coherence of such a hierarchy of terms. If pollution, for example, is
listed under 'Society and Culture'/'Environment and Nature'/'Pollution', then the logic
can be accepted to some degree, but every complicated case will lead to classificatory
conflicts that can no longer be solved even by supplementary cross-references.
The construction of the hierarchy appears as a rather hybrid project, but its aim is to
harness to a uniform system of categories millions of completely heterogenous
contributions from virtually every area of human knowledge. Without regard to their
perspecitivity, their contradictions and rivalries.
Yahoos 'ontology' is thus the encumbered heir of those real ontologies whose
recurrent failure can be traced throughout the history of philosophy. And the utilitarian
context alone explains why the philosophical problem in new guise failed to be identified,
and has been re-installed yet again with supreme naivety. If the worst comes to the worst,
you dont find what youre looking for - that the damage is limited is what
separates Yahoo from problems of philosophy.
The second type of search engine manages without a pre-defined classication system and,
even more importantly, without human coders. Systems like 'AltaVista', 'Inktomi' or
'Lycos' generate an 'inverted index' by analysing the texts located. The search method
employed is the full-text variant - word for word, meaning that in the end every single
term used in the original text is contained in the index and available as a search word.
This is less technically demanding than it might appear. For every text analysed, a row is
created in a huge cross-connected table, while the columns represent the general
vocabulary; if a word is used in the text, a bit is set to yes, or the number
of usages is noted. An abstract copy of the text is made in this way, condensed to c. 4%
of its original size. The search inquiries now only make use of the table.
Since the system is fully automatic, the Alta Vista spider can evaluate 6 million Net
documents every day. At present, some 125 million texts are represented in the system.(9)
The results of a search are, in fact, impressive. AltaVista delivers extremely useful hit
lists, ordered according to an internal priority system. And those who found what they
were looking for are unlikely to be offended by the fact that AltaVista too keeps its
algorithm under wraps.
There are some problems nevertheless. It is conspicuous that even slight variations in the
query produce wholly different feedback; if you try out various queries for a document you
already know, you will notice that one and the same document is sometimes displayed with
high priority, sometimes with lower priority, and sometimes not at all. This is
irritating, to say the least.
The consequence, in general terms, is that often one does not know how to judge the result
of a search objectively - it remains unclear which documents the system does not supply
because either the spider has failed to locate them or because the evaluation algorithm
does indeed work otherwise than presumed. Even if the program boastfully claims to be
searching the web, the singular form of the noun is illusory, of
course, if you consider the fact that even 125 million texts are only a specific section
of the overall expanse. Furthermore, users for their part can register only the first 10,
50 or, at most, 100 entries. They too scarcely have the possibility of estimating how this
section relates to the rest of the expanse in terms of content.
The second, and main, problem is however present already in the basic assumption. A
mechanical keyword search presupposes that only such questions will be posed as are able
to be clearly formulated in words, and differentiated and substantiated through further
keywords. Similarly, nobody will expect that the system is able to include concepts of
similar meaning alongside the query, or can exclude homonyms. Search engines of this type
are wholly insensible to questions of semantics or, to make it more clear: their very
point is to exclude semantic problems of the type evident with Yahoo. Yet that is not to
say that the problems themselves are eradicated. They are imposed on the users through the
burden of having to reduce their questions to unambiguous strings of significants, of
having to be satisfied with the mechanically selected result. All questions unable to be
reduced to keywords fall through the screen of the feasible. Technical and scientific
termini are relatively suitable for such a search, humanistic subjects are less suitable,
and once again this emerges as that soft - all too soft - sphere which should
be circumvented from the outset, if one is unwilling to fall into the abyss.
But the problem of semantics has not been ignored, and efforts in this direction have led
to the third type of search engine. Systems like Excite by Architext, or
Smart, claim to search no longer mechanically with strings of significants,
but on the basis of a factual semantic model. In order to be able to discriminate between
articles on oil films and ones on cinema films, such programs examine the context in which
the respective concepts figure.
"The idea is to take the inverted index of the Web, with its rows of documents and
columns of keywords, and compress it so that documents with roughly similar profiles are
clustered together - even if one uses the word 'movie' and one uses 'film' - because they
have many other words in common."(10)
The result is a matrix where the columns now represent concepts instead of mechanical
keywords.
The exciting thing about this type of engine is that it progresses from mechanical
keywords to content-related concepts; and also that it obtains its categories solely on
the basis of the entered texts, of a statistical evaluation of the documents.
"[The engine] learns about subject categories from the bottom up, instead of
imposing an order from the top down. It is a self-organizing system. [...] To come up with
subject categories, Architext makes only one assumption: words that frequently occur
together are somehow related. As the corpus changes - as new connections emerge between,
say O. J. Simpson and murder - the classification scheme automatically adjusts. The
subject categories reflect the text itself"; "this eliminates two of the
biggest criticisms of library classification: that every scheme has a point of view, and
that every scheme will be constantly struggling against obsolescence." (11)
Other designs, such as the 'Context' system by Oracle, attempt to incorporate analyses of
the syntax, and by doing so find themselves in the minefield of how to model natural
language - a problem that has been worked upon in the field of AI since the 1960s, without
convincing results having been produced so far. The evaluation of such systems is more
than difficult; and it is even more difficult to make forecasts about the possible chances
of developments.
For that reason, I would like to shift the focus of the question from the presented
systems mode of function and their implications and limitations to the sociocultural
question of what their meaning is, what their actual project is in the concurrence of
discourses and media.
3
The path from the hierarchic ontologies over the keyword search and on to the semantic
systems shows, in fact, that it is a matter of a very fundamental question beyond the
pragmatic usage processes. The search engines are not a random tool that
supplements the presented texts and facilitates their handling. On the contrary, they
appear as a systematic counterpart on which the texts are reliant in the sense of a
reciprocal and systematic interrelation.
My assertion is that the search engines occupy exactly that position which -- in the case
of non-machine-mediated communication -- can be claimed by the system of language.
(And that is the main reason why search engines interest me.)
Language, as Saussure clearly showed, breaks down into two modes of being, two aggregate
states. Opposite the linear, materialized texts in the external world -- utterances,
speech events, written matter -- exists the semantic system that, as a knowledge, as a
language competence, has its spatially distributed seat in the minds of the language
users. Minds and texts are therefore always opposite each other.
If access to the data network is now organized over systems based on vocabulary, and if
these systems are being advanced in the direction of semantically qualifying machines,
then this means that language itself, the semantic system, the lexicon, is to be liberated
from the minds and technically implemented in the external world. In other words: not just
the texts are to be filed in the computerized networks, but the entire linguistic system.
The search engines, with all their flaws and contradictions, are a kind of advance payment
on this project.
Search engines, then, represent language in the network. And this has completely
changed the emphasis. The engines face the texts not as additional tools but as the
actual structure which the texts merely serve; a machine for opening up, but
at the same time a condensate that represents the body of texts as a whole.
4
The conjecture that it is a matter of the language admits a new perspective on the
internal organization of search engines. And it becomes clear that engines have prominent
predecessors in the history of knowledge and historical notions of language.
It is difficult not to see in the hierarchically composed structure of the Yahoo pyramid
of concepts those medieval models of the world described for us by writers such as Bolzoni
in her history of mnemonics. (12) A large
14th-century panel shows the figure of Jesus in the center of the tree of life, whose
branches and leaves all contain stations in his earthly existence, his path to the Cross
and his transfiguration. A second picture, this time from the 13th century, shows a
horse-mounted knight who is riding, sword drawn, toward the Seven Deadly Sins, which are
divided up into a scheme of fields branching of step-by-step into the infinite diversity
of the individual sins.(13) Bolzoni explains that
such schemes initially served didactic mnemonic purposes; order and visualization made it
easier to note the complex connections. But their actual meaning goes further. The
implicit ambition of these systems was to bring the things of the world into a consistent
scheme, namely into a necessarily hierarchic scheme that no less necessarily culminated in
the concept of God. Only the concept of God was capable of including all other concepts
and furnishing a stable center for the pyramidic order. The linguistic structure (the
cathedral of concepts) and the architecture of knowledge were superimposed over each other
in this order of things. This metaphysical notion of language has become
largely alien to us today. But is it really alien?
As far as Yahoos surface is concerned, if you will permit the abrupt return to my
subject, it manages without an organizing center. The user faces 14, not one, central
categories from which the sub-categories branch off. Thus, the pyramid has lost its tip.
Or would it be more appropriate to ask what has taken Gods place?
In a model of the world created by Robert Fludds, (14)
an English encyclopedist of the Renaissance, God had already abandoned the center
position. Retained has been a system of strictly concentric rings that contains the things
of the world, encompassing a range from minerals to the plants and animals of nature up to
the human arts and finally the planetary spheres. The center is occupied by a schematic
diagram of the earth, a forerunner of that blue ball the astronauts radio-relayed to
Earth. The representation looks like a mandala in which viewers can absorb themselves in
order to take up contact with a cosmic whole. The new, secularized solution becomes even
more distinct in the memory theater of the Italian Camillo which, frequently discussed in
the meantime, itself belongs to the history of technical media. At the beginning of the
16th century, Camillo built a wooden construction resembling a small, round theater.(15) Those who ventured inside were confronted by a
panel of 7x7 pictures Camillo had commissioned from highly respected painters of the
period. The horizontal division corresponded to the seven planetary spheres, the vertical
division to seven stages of development from the first principles up to the elements, to
the natural world, to the human-being, to the arts and, finally, the sciences. In this
way, every field in the matrix represented a certain aspect of the cosmos. The images were
merely there to convey the general picture, whereas behind them were compartments with the
texts written by the great writers and philosophers. It was in these compartments, then,
that the user looked for sources, concepts and further information. To this extent, the
whole thing was a system of access, and the analogy with search engines becomes evident in
the clear separation between the access to the texts and the texts themselves.
Camillos theatre has finally brought the human-being, the viewer, into the center of
the construction. The surface of the images is oriented to his view, and solely the
beholders perspective joins up the 49 fields in the matrix. Exactly that appears to
me to be the logic on which Yahoo is based. The very lack of the pinnacle in the pyramid
of concepts defines the position taken by the user. Like in the optical system of the
central perspective, the royal overlooking position is reserved for the
user/beholder.
Yahoo is indeed an ontology; but not because Yahoo and likewise ontologies are
arbitrary. It is more because they keep things in their place, and define for the user a
position relative to this place. Its ontology offers an ordered world. And anything
threatening to be lost in the chaotic variety of available texts can take one final
respite in the order of the search engine.
The solution, however, is historically outdated, and has been abandoned in the history of
philosophy. Because any positively defined hierarchy of concepts is perspectival and
arbitrary, it soon reveals those points of friction which represent the beginning of its
end. Does this make the solution of the keyword- or semantics-based engines more modern?
It must indeed appear to be so at first glance. The strategy of making the search words
dependent on the empirically collected content of the network documents - the texts -
imitates the mechanism of language itself. Or the mechanism, to be more precise, by which
language arrives at its concepts.
Linguistic theory tells us that the synchronous system of language is created through the
accumulation and condensation of an infinite multitude of concrete utterances. The place
where condensation takes place is the language users memory, where the concrete
utterances are submerged; linear texts are obliviated into the structure of our language
capability; on the basis of concrete texts, this structure is subject to constant
modification and differentiation. Our faculty of language is an abstract copy of speaking
- speech and language (discourse and system) are systematically cross-linked. (16)
What this means for the isolated concept is that it accumulates whatever the tangible
contexts provide as meaning. It isnt a one-time act of definition that assigns it a
place in the semantic system, but the disorderly chain of its usages; concepts stand for
and typify contexts, concepts encapsulate past contexts.(17)
The semantic search engines imitate this accumulation by typifying contexts in order to
arrive at concepts - in this case the search concepts. As outlined above, the table of
search words is created as a condensed, cumulated copy of the texts. A statistical
algorithm draws together comparable contexts, typifies them, and assigns them to the
search concepts as the equivalent of their meaning.
A system imbued with such dynamism is superior to the rigidly pre-defined systems, even if
the statistical algorithm only imperfectly models the mechanisms of natural language. More
complex, closer to intuition, it is bound to offer less centers of friction. So, once
again, whats the objection?
5
Its important to remember that, despite all the advances made, the actual
fundamental order has remained constant. Just as in Camillos wooden theater, we are
dealing not with only two instances - a set of reading/writing/searching subjects
approaching a second set of written texts - but also with a third instance, namely a
system of access that has placed itself between the first two like a grid, or raster.
And if the access system in Camillos media machine served to break down the infinite
expanse of texts into a manageable number of categories from which the position - from a
strictly central perspective - was defined for the observing subject, then this
fundamental order remains intact also.
This image makes it clear that it is not necessarily better if the raster cannot be felt.
Its almost the other way round: the less resistance offered by the access system,
the more neutral, transparent and weightless it seems, and the more plausible appears the
suspicion that it cannot be a question of the nature of thing, but of a naturalization
strategy.
The raster of categories must purport to be transparent if it does not want to
rouse the problems that Yahoo rouses. To avoid the reproach of being arbitrary and
exercising a structuring influence on the contents accessed, the raster must instill in
the users the impression of being purely a tool subject only to utility - the
key in the customers hand that opens any Sesame, a compliant genie with no ambitions
of its own.
This puts the veil of secrecy cast over the algorithms in a somewhat different light. Far
more important than the rivalry between different product suppliers is the wish to
actually dispose over a neutral, transparent access machine - and this wish is something
the makers share with their customers, and probably with us all. At the basis of the
constellation emerges an illusion which organizes the discourse.
Since there is no such thing as algorithms without their own weight, the meta-discourse
has to help them out and salvage transparency by means of mere assertions. In the usage of
the salutary singular (searching the Web), in the way the algorithms are kept
under wraps, in the emphasis on the performance as opposed to the limitations that might
be more defining, and in the routine promises that, thanks to Artificial Intelligence, new
and even more powerful systems are in the pipeline.(18)
In the unawareness and unwillingness to know on the part of the customers, and in the
primacy of a practice which mostly, in any case, doesnt know what its doing.
Data processing - and one feels almost cynical in bringing up this point - was propagated
with the ideal of creating a very different type of transparency. The promise was to
create only structures which were in principle able to be understood - the opposite, in
fact, of natural language; to confine itself to the structural side of things, but to
escribe this in a way that would not only admit analysis, but apparently include the
latter from the outset. If programs have now, as Kittler correctly notes, begun to
proliferate like natural-language texts, then this is not because the programs (and
already even the search engines) have been infected by the natural-language texts. It is
because of our need for both: for unlimited complexity and the narcissistic pleasure of
having an overview, the variety of speaking and the transparency with regard to the
objects, a language without metaphysical hierarchic centering that still maintains its
unquestionable coherence.
That our wish is once again doomed to failure is clear from the fact that any number of
search engines of different design are competing with each other in the meantime, and that
meta-search engines are now said to be able to search through search engines. So there we
sit on Gods deserted throne, opposite us the infinite universes of texts, in our
hands a few glittering, but deficient, machines. And we feel uneasy.
Notes:
(2) Information acc. to AltaVista, http://www.doubleclick.net, 28 Aug.
98. back
(3) http://www.yahoo.com/docs/pr/release192.html, 28 Aug. 98. back
(4) Steinberg, Steve G.: Seek and Ye Shall Find (Maybe). In: Wired, No.
4.05. May 1996, pp. 108-114, 174-182, as well as in the on-line issue:
http://www.wired.com/wired/4.05/features/indexweb.html.
Tilman Baumgärtel has presented a second investigation: B., T.: Reisen ohne Karte. Wie
funktionieren Suchmaschinen? Schriftenreihe des Wissenschaftszentrums Berlin für
Sozialforschung, 1998. back
(5) Steinberg, op. cit., p. 175. back
(6) Interestingly, the self-presentation on the companys Website
offers no current information. back
(7) Steinberg, op. cit., p. 113. back
(8) Figures acc. to @-online today, No. 3/98. back
(9) Information acc. to AltaVista:
http://www.altavista.digital.com/av/ content/about_our_technology.html, 28 Aug. 98. back
(10) Steinberg, op. cit., p. 175. back
(11) Ibid. (additions by H.W.). back
(12) Bolzoni, Lina: The Play of Images. The Art of Memory from Its
Origins to the Seventeenth Century. In: Corsi, Pietro (Hg.): The Enchanted Loom. Chapters
in the History of Neuroscience. New York/Oxford: Oxford Univisersity Press 1991, pp.
16-65. back
(13) Ibid., pp. 27-29. back
(14) 'Integrae Naturae speculum artisque imago' (1617), British
Library, London. back
(15) See, for example, Yates, Frances A.: Gedächtnis und Erinnern.
Mnemonik von Aristoteles bis Shakespeare. Weinheim 1991, pp. 123ff (Engl. OE 1966). back
(16) The deliberations upon language merely touched upon here are
derived from my book Docuverse. Zur Medientheorie der Computer. München: Boer
1997. The language model is outlined in the first chapter, the ideas on cumulation and
condensation in the fourth chapter (http://www.rz.uni-frankfurt. de/~winkler). back
(17) "It is indeed a characteristic of language - and another
aspect of the 'problem of the word' - that it has this constant but never fully realised
tendency to encapsulate a kind of complete (but concentrated, compressed) 'argument' in
every word: a tendency which is also intrinsically condensatory. Even the most ordinary
word, lamp for instance, is the meeting-point for several 'ideas' [...] each of which, if
it were unravelled, or decondensed, would require a whole sentence". "Past
condensations meet in each word of the language [...] this is to define the lexicon itself
as the product of an enormous condensation". (Metz, Christian: The Imaginary
Signifier. Bloomington 1982, pp. 225, 239 (French OE 1973-76, first published in book form
in 1977). back
(18) The present state of debate is represented by systems such as
PointCast - an agent program that searches through the Net on behalf of individual users,
and is equipped with their priorities (www.pointcast.com), or NetSum, a program made by
the British Telecom Natural Language Labs, which automatically generates abstracts on the
basis of language statistics. back