Perlin is a library with the aim to provide information retrieval functionality in a performant and understandable manner.
This blog post is part three of a series about improving indexing speed.
Last blog post identified memory access patterns as bottleneck for indexing speed. It concluded, that chunking allocations and compressing listings during indexing might result in better memory locality and thus solve the problem of LL-cache misses.
In this post implementation details of these ideas are described and their impact on performance is evaluated.
Abstract Idea
Let’s start by looking at the adjusted algorithm in a abstract way. Like in the last post there are three basic tasks to be done during indexing:
- Matching term to term_id
- Sorting and grouping (term_id, doc_id, position)-triplets
- And appending the results to the listing of each term_id
This does not change at all. What changes, is the way (document_id, <term_position>)
is appended to the inverted index. That task is done by a new struct called IndexingChunk
.
IndexingChunk
IndexingChunk
is a element in an unrolled linked list of compressed indexing data. Its task is to compress listings and append the resulting bytes to itself.
When full, it shall report how much of the listing fitted into it.
To remove one level of indirection, the payload of this chunk is not modelled as Vec
but rather as a fixed size array:
pub struct IndexingChunk {
//Some fields omitted
previous_chunk: u32,
capacity: u16,
last_doc_id: u64,
data: [u8; SIZE],
}
The field ‘last_doc_id’ stores the document_id
of the last appended posting and makes it thus possible to delta-encodedoc_id
s.
Also, capacity
stores the remaining capacity of the chunk and previous_chunk
will be filled with the id of the next chunk once it is created.
Acting as central and single method of IndexingChunk
is append
which takes a listing to be appended and returns either Ok
or the number of postings that fitted into the chunk:
pub fn append(&mut self, listing: &[(u64, Vec<u32>)]) -> Result<(), usize>
Simplified and translated into pseudo-code, it does the following:
for every (doc_id, positions) in listing
delta and vbyte encode doc_id
try to write resulting bytes to payload
for every position in positions
delta and vbyte encode position
try to write resulting bytes to payload
else
reset capacity
return current iteration over listing
So far so good. These chunks are organised by a struct called ‘ChunkedStorage` (this name seems bad and will probably be changed).
ChunkedStorage
ChunkedStorage
has the task to organise IndexingChunks
by keeping hot chunks (the ones which are currently written to) close together and archive full chunks so that they can be accessed later for querying.
In its current state it is not fully functional in the sense that it does not provide functionality for querying. But more on that later.
pub struct ChunkedStorage {
hot_chunks: Vec<IndexingChunk>, //Has the size of vocabulary
archived_chunks: Vec<IndexingChunk>,
}
In later iterations, archived_chunks will be replaced by something more general, something that allows writing chunks to different storage types (i.e. RAM, Memory, maybe Network).
ChunkedStorage
has three relevant methods, they are all straightforward and fun:
impl ChunkedStorage {
pub fn new_chunk(&mut self, id: u64) -> &mut IndexingChunk {
self.hot_chunks.push(IndexingChunk {
previous_chunk: 0,
last_doc_id: 0,
capacity: SIZE as u16,
data: unsafe { mem::uninitialized() },
});
&mut self.hot_chunks[id as usize]
}
pub fn next_chunk(&mut self, id: u64) -> &mut IndexingChunk {
let next = IndexingChunk {
previous_chunk: self.archived_chunks.len() as u32,
last_doc_id: self.hot_chunks[id as usize].last_doc_id,
capacity: SIZE as u16,
data: unsafe { mem::uninitialized() },
};
//Move the full chunk from the hot to the cold
//That's more fun than I thought
self.archived_chunks.push(mem::replace(&mut self.hot_chunks[id as usize], next));
&mut self.hot_chunks[id as usize]
}
#[inline]
pub fn get_current_mut(&mut self, id: u64) -> &mut IndexingChunk {
&mut self.hot_chunks[id as usize]
}
}
By keeping the hot chunks in an extra vector it is ensured, that they are close together in memory which should result in less cache-misses and thus improve performance.
Adjusted Implementation for Task_3
Task 3 now has to use ChunkedStorage
and IndexingChunk
of course. But it all feels good and comes very naturally:
let mut storage = ChunkedStorage::new(4000);
while let Ok(chunk) = grouped_chunks.recv() {
let threshold = storage.len();
for (term_id, listing) in chunk {
let uterm_id = term_id as usize;
// Get chunk to write to or create if unknown
let result = {
let stor_chunk = if uterm_id < threshold {
storage.get_current_mut(term_id)
} else {
storage.new_chunk(term_id)
};
stor_chunk.append(&listing)
};
// Listing did not fit into current chunk completely
// Get the next and put it in there
// Repeat until done
if let Err(mut position) = result {
loop {
let next_chunk = storage.next_chunk(term_id);
if let Err(new_position) = next_chunk.append(&listing[position..]) {
position += new_position;
} else {
break;
}
}
}
}
}
Results
Now. I hope you are as exited as I am. Let’s see what we’ve got. I am using the same machine and the same test collection as in lasts blog post, so results should be comparable.
./testbench ~/test.bin
DONE! Indexed 100000 documents each 250 terms totalling at 200Mb in 2446ms
At a rate of 81Mb/s
That’s an improvement of 50% or about 25Mb/s over the last version. Let’s see if the cache misses and stalled cycles changed:
perf stat ./testbench ~/test.bin
DONE! Indexed 100000 documents each 250 terms totalling at 200Mb in 2445ms
At a rate of 81Mb/s
100000
Performance counter stats for './testbench /home/jdemler/test.bin':
9168.706474 task-clock:u (msec) # 3.690 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
342,090 page-faults:u # 0.037 M/sec
21,329,469,020 cycles:u # 2.326 GHz
9,293,938,235 stalled-cycles-frontend:u # 43.57% frontend cycles idle
5,693,540,682 stalled-cycles-backend:u # 26.69% backend cycles idle
37,224,017,304 instructions:u # 1.75 insn per cycle
# 0.25 stalled cycles per insn
5,635,594,793 branches:u # 614.655 M/sec
117,443,035 branch-misses:u # 2.08% of all branches
2.484579972 seconds time elapsed
That’s 10% less frontend cycles idle and +0.5 instructions per cycle.
perf stat -e L1-icache-loads,L1-icache-load-misses,L1-dcache-loads,L1-dcache-load-misses,
LLC-loads,LLC-load-misses ./testbench ~/test.bin
DONE! Indexed 100000 documents each 250 terms totalling at 200Mb in 2389ms
At a rate of 83Mb/s
100000
Performance counter stats for './testbench /home/jdemler/test.bin':
12,159,900,932 L1-icache-loads:u (66.56%)
12,582,764 L1-icache-load-misses:u (66.54%)
7,979,740,445 L1-dcache-loads:u (66.71%)
416,592,405 L1-dcache-load-misses:u # 5.22% of all L1-dcache hits (66.75%)
70,240,608 LLC-loads:u (66.77%)
9,644,759 LLC-load-misses:u # 0.16% of all LL-cache hits (66.77%)
2.429317333 seconds time elapsed
As expected, the LLC-load-misses are down from 56M to 10M. The L1-dcache-misses though stayed about the same. 400M in the old version vs 416M now.
Conclusion and future Work
The indexing is now 50% faster, but everything else is broken. I prototyped this change into the existing code, leaving nothing but scorched earth. The next week will be spent cleaning this mess up and marrying the query system with the new chunked indexing system.
Nevertheless, L1-cache misses are still high and could probably still be optimised. But, I think micro-optimisations beyond this point should only be considered when real world applications see a need for it. The current indexing speed should be sufficient for most information retrieval needs in small projects. I spent more than enough time on that anyway.
