Implementing a key-value database in Golang, Part 3 - Indexation
This is the third of several blog posts about implementing a simple key-value store in Golang. All code described in this blog-post can be found here: https://github.com/olif/kvdb.
Previous posts:
Intro
In my last post we introduced the append-only log as a way to achieve
durability. As we could see from the benchmark tests, the write performance was
O(1)
, i.e the performance was unaffected by the total number of records
residing in the database.
The read performance was affected directly and degraded linearly with the number
of existing records, i.e the amortized time complexity for the read operation
was O(n)
.
In this post (which I hope will be quite short) we will try to increase the read performance by adding an index.
Database indexes
A database index is an additional data structure which is derived from the primary data1 whose primary purpose is to add support for sub-linear time lookup to improve (read) performance.
All database indexes impacts the write performance since when writing data, the indexes also needs to be updated.
This is an important tradeoff in databases, indexes can significanly speed up the read performance but it will slow down writes and this tradeoff is left to you as the database maintainer to make.
Implementing a simple index
The aol
engine from the previous blog post scaled extremely well for writes
since we where just appending data to a file. The read performance, not so much.
We did expect poor read behaviour because of how the implementation where made
and we could later confirm this with benchmark tests.
In fact, the average and the worst case scenarios where the same since several records could contain the same key which meant that we always needed to iterate over all records. This was true even in the case when the key was missing.
So, how can we improve the read performance? What if we knew where in the log file to look?
A very simple indexing strategy could be to store the position of the record together with the key in-memory. Then, when we are adding a new value, we append the value as before but we also add the key and the offset for the record to a simple hashtable. When we want to look up a value we look in the hashtable to see if the key exist and if it does, we seek to the offset given by the hashtable value and then read the record.
This means that the index needs to be rebuilt from the append-log so that the database can survive a system reboot, for instance.
Implementation
The implementation is straightforward taking the learnings from the previus blog posts into account.
We first define our index as:
type index struct {
mutex sync.RWMutex
table map[string]int64
cursor int64
}
func (i *index) get(key string) (int64, bool) {
i.mutex.RLock()
defer i.mutex.RUnlock()
val, ok := i.table[key]
return val, ok
}
func (i *index) put(key string, written int64) {
i.mutex.Lock()
defer i.mutex.Unlock()
i.table[key] = i.cursor
i.cursor += written
}
which stores the current cursor position (the next position in the append-log to which we will apend data to) and a hashtable which we make safe to use under concurrent use.
Rebuilding the index is straghtforward:
func buildIndex(filePath string, maxRecordSize int) (*index, error) {
idx := index{
cursor: 0,
table: map[string]int64{},
}
f, err := os.OpenFile(filePath, os.O_RDONLY|os.O_CREATE, 0600)
defer f.Close()
if err != nil {
return nil, err
}
scanner, err := record.NewScanner(f, maxRecordSize)
if err != nil {
return nil, err
}
for scanner.Scan() {
record := scanner.Record()
idx.put(record.Key(), int64(record.Size()))
}
if scanner.Err() != nil {
return nil, fmt.Errorf("could not scan entry, %w", err)
}
return &idx, nil
}
. The buildIndex
method is called when initializing the database. We just
iterate over all the data in our append-log and adds the key with the
corresponding offset to our index.
We are now ready to implement the rest of the engine and this time we will start
with the Put
/ Delete
operations.
// Put stores the value for the given key
func (store *Store) Put(key string, value []byte) error {
record := record.NewValue(key, value)
return store.append(record)
}
// Delete deletes the data for the given key
func (store *Store) Delete(key string) error {
record := record.NewTombstone(key)
return store.append(record)
}
func (store *Store) append(record *record.Record) error {
if record.Size() > store.maxRecordSize {
msg := fmt.Sprintf("key-value too big, max size: %d", store.maxRecordSize)
return kvdb.NewBadRequestError(msg)
}
store.writeMutex.Lock()
defer store.writeMutex.Unlock()
file, err := os.OpenFile(store.storagePath, os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0600)
defer file.Close()
if err != nil {
return fmt.Errorf("could not open file: %s for write, %w", store.storagePath, err)
}
n, err := record.Write(file)
if err != nil {
return fmt.Errorf("could not write record to file: %s, %w", store.storagePath, err)
}
if !store.async {
if err := file.Sync(); err != nil {
return err
}
}
if err := file.Close(); err != nil {
return err
}
store.index.put(record.Key(), int64(n))
return nil
}
The only thing that differs from the aol implementation is that we update the
index just before returning from the append
method.
The Get operation is also very similar to the aol implementation:
// Get returns the value for the given key, a kvdb.NotFoundError if the
// key was not found or other errors encountered
func (store *Store) Get(key string) ([]byte, error) {
offset, ok := store.index.get(key)
if !ok {
return nil, kvdb.NewNotFoundError(key)
}
f, err := os.OpenFile(store.storagePath, os.O_CREATE|os.O_RDONLY, 0600)
defer f.Close()
if err != nil {
return nil, err
}
_, err = f.Seek(offset, io.SeekStart)
if err != nil {
return nil, err
}
scanner, err := record.NewScanner(f, store.maxRecordSize)
if err != nil {
return nil, err
}
if scanner.Scan() {
record := scanner.Record()
if record.IsTombstone() {
return nil, kvdb.NewNotFoundError(key)
}
return record.Value(), nil
}
return nil, kvdb.NewNotFoundError(key)
}
. We just try to get the offset position of the record for the given key, if it does not exist in our index it does not exist in the append-log either so we can return early.
In case the key exist we use the offset and `Seek` to that position in the append-log.
Performance
By adding the index we expect that our Read performance should improve significantly and hopefully it will only have a very limited impact on our Write performance.
Lets compare the performance with the aol implementation:
❯ go test -run=^$ -bench='Benchmark(Aol|IndexedAol)(Write)[0-9]*' ./test | prettybench
goos: darwin
goarch: amd64
pkg: github.com/olif/kvdb/test
PASS
benchmark iter time/iter throughput
--------- ---- --------- ----------
BenchmarkAolWrite100Db100-4 12118 95.99 μs/op 1.23 MB/s
BenchmarkIndexedAolWrite100Db100-4 10000 104.12 μs/op 1.13 MB/s
BenchmarkAolWrite1000Db100-4 11229 111.35 μs/op 9.14 MB/s
BenchmarkIndexedAolWrite1000Db100-4 9364 109.09 μs/op 9.33 MB/s
ok github.com/olif/kvdb/test 6.491s
Here we are measuring the time and throughput for writing a single record with a value of byte size 100 resp 1000 bytes without syncing to disk. As we can see, there is a small difference in performance but not significant in favor to the aol implementation.
Now let's check out the difference in Read performance:
❯ go test -run=^$ -bench='Benchmark(Aol|IndexedAol)(Read)[0-9]*' ./test | prettybench
goos: darwin
goarch: amd64
pkg: github.com/olif/kvdb/test
PASS
benchmark iter time/iter
--------- ---- ---------
BenchmarkAolRead10000Db100-4 3 495.21 ms/op
BenchmarkIndexedAolRead10000Db100-4 814 1.53 ms/op
ok github.com/olif/kvdb/test 10.470s
. We are reading a random value from the database with an dataset of 10.000 records and as we can see, the performance gain by using the index is huuuge.
Comparing how the indexed aol scales compared to the aol implementation also gives very positive feedback:
❯ go test -run=^$ -bench='Benchmark(IndexedAol|Aol)ReadFrom*' ./test | prettybench
goos: darwin
goarch: amd64
pkg: github.com/olif/kvdb/test
PASS
benchmark iter time/iter
--------- ---- ---------
BenchmarkIndexedAolReadFrom10Db-4 78820 13.06 μs/op
BenchmarkIndexedAolReadFrom100Db-4 99619 12.79 μs/op
BenchmarkIndexedAolReadFrom1000Db-4 100755 13.24 μs/op
BenchmarkIndexedAolReadFrom10000Db-4 82573 15.08 μs/op
BenchmarkAolReadFrom10Db-4 76866 14.18 μs/op
BenchmarkAolReadFrom100Db-4 24978 50.28 μs/op
BenchmarkAolReadFrom1000Db-4 3510 383.05 μs/op
BenchmarkAolReadFrom10000Db-4 253 4652.85 μs/op
ok github.com/olif/kvdb/test 19.207s
. Here we are reading a random value from a database that contains N records N={10,100,1000,10000} and as we can see, the index gives us near constant time performance compared to linear performance for the aol without an index.
Success.
Summary
In this post we tried to improve the read performance of our database engine by introducing an index.
Indexes always come with a performance tradeoff between Read and Writes. In our
case the impact on the write performance was negligible while at the same time
the read performance went from O(n)
to O(1)
.
We now have a very simple database engine with good scalability but is still has a couple of issues that we need to resolve. The most important one is that the overall size of the database is monotonically increasing. Even a Delete operation will increase the overall size since a Delete is represented by a tombstone record that is appended to the log. Eventually we will run out of disk.
As usual, we will try to solve this in an upcoming blog post.