Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • People
  • Outputs
  • Organizations
  • Expertise & Skills

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • People
  • Outputs
  • Organizations
  • Expertise & Skills
  1. Outputs

Compressed representations of sequences and full-text indexes

Academic Article
Publication Date:
2007
abstract:
Given a sequence S = s_1 s_2 ... s_n of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any s_q, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH_0(S) + o(n log r) bits and answer queries in O(log r/log log n) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet. Specifically, we design a variant of the FM-index that indexes a string T[1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the kth-order empirical entropy of T. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T, which is the best space obtained in previous work.
Iris type:
01.01 Articolo in rivista
Keywords:
Burrows-Wheeler transform; Text indexing; entropy; text compression; wavelet tree
List of contributors:
Manzini, Giovanni
Handle:
https://iris.cnr.it/handle/20.500.14243/24794
Published in:
ACM TRANSACTIONS ON ALGORITHMS
Journal
  • Use of cookies

Powered by VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)