Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze
  1. Pubblicazioni

Discovery of ads web hosts through traffic data analysis

Contributo in Atti di convegno
Data di Pubblicazione:
2004
Abstract:
One of the most actual problems on web crawling -- the most expensive task of any search engine, in terms of time and bandwidth consumption -- is the detection of useless segments of Internet. In some cases such segments are purposely created to deceive the crawling engine while, in others, they simply do not contain any useful information. Currently, the typical approach to the problem consists in using a human-compiled blacklist of sites to avoid (e.g., advertising sites and web counters), but, due to the strongly dynamical nature of Internet, keeping them manually up-to-date is quite unfeasible. In this work we present a web usage statistics-based solution to the problem, aimed at automatically -- and, therefore, dynamically -- building blacklists of sites that the users of a monitored web-community consider (or appear to consider) useless or uninteresting. Our method performs a linear time complexity analysis on the traffic information which yields an abstraction of the linked web which can be incrementally up- dated, therefore allowing a streaming computation. The crawler can use the list produced in this way to prune out such sites or to give them a low priority before the (re-)spidering activity starts and, therefore, without analysing the content of crawled documents.
Tipologia CRIS:
04.01 Contributo in Atti di convegno
Keywords:
Data Mining
Elenco autori:
Pedreschi, Dino; Giannotti, Fosca; Nanni, Mirco
Autori di Ateneo:
NANNI MIRCO
Link alla scheda completa:
https://iris.cnr.it/handle/20.500.14243/57527
  • Dati Generali

Dati Generali

URL

http://doi.acm.org/10.1145/1008707
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)