Implementing a key-value database in Golang, Part 3 - Indexation

/images/kvdb-3/kvdb-3.jpg

This is the third of several blog posts about implementing a simple key-value store in Golang. All code described in this blog-post can be found here: https://github.com/olif/kvdb.

Previous posts:

Intro

In my last post we introduced the append-only log as a way to achieve durability. As we could see from the benchmark tests, the write performance was O(1), i.e the performance was unaffected by the total number of records residing in the database.

The read performance was affected directly and degraded linearly with the number of existing records, i.e the amortized time complexity for the read operation was O(n).

In this post (which I hope will be quite short) we will try to increase the read performance by adding an index.

Database indexes

A database index is an additional data structure which is derived from the primary data1 whose primary purpose is to add support for sub-linear time lookup to improve (read) performance.

All database indexes impacts the write performance since when writing data, the indexes also needs to be updated.

This is an important tradeoff in databases, indexes can significanly speed up the read performance but it will slow down writes and this tradeoff is left to you as the database maintainer to make.

Implementing a simple index

The aol engine from the previous blog post scaled extremely well for writes since we where just appending data to a file. The read performance, not so much. We did expect poor read behaviour because of how the implementation where made and we could later confirm this with benchmark tests.

In fact, the average and the worst case scenarios where the same since several records could contain the same key which meant that we always needed to iterate over all records. This was true even in the case when the key was missing.

So, how can we improve the read performance? What if we knew where in the log file to look?

A very simple indexing strategy could be to store the position of the record together with the key in-memory. Then, when we are adding a new value, we append the value as before but we also add the key and the offset for the record to a simple hashtable. When we want to look up a value we look in the hashtable to see if the key exist and if it does, we seek to the offset given by the hashtable value and then read the record.

/images/kvdb-3/kvdb-3-simpleindex.svg
Simple hash index with append log where the crc and type fields have been omitted

This means that the index needs to be rebuilt from the append-log so that the database can survive a system reboot, for instance.

Implementation

The implementation is straightforward taking the learnings from the previus blog posts into account.

We first define our index as:

type index struct {
	mutex  sync.RWMutex
	table  map[string]int64
	cursor int64
}

func (i *index) get(key string) (int64, bool) {
	i.mutex.RLock()
	defer i.mutex.RUnlock()
	val, ok := i.table[key]
	return val, ok
}

func (i *index) put(key string, written int64) {
	i.mutex.Lock()
	defer i.mutex.Unlock()
	i.table[key] = i.cursor
	i.cursor += written
}

which stores the current cursor position (the next position in the append-log to which we will apend data to) and a hashtable which we make safe to use under concurrent use.

Rebuilding the index is straghtforward:

func buildIndex(filePath string, maxRecordSize int) (*index, error) {
	idx := index{
		cursor: 0,
		table:  map[string]int64{},
	}

	f, err := os.OpenFile(filePath, os.O_RDONLY|os.O_CREATE, 0600)
	defer f.Close()
	if err != nil {
		return nil, err
	}

	scanner, err := record.NewScanner(f, maxRecordSize)
	if err != nil {
		return nil, err
	}

	for scanner.Scan() {
		record := scanner.Record()
		idx.put(record.Key(), int64(record.Size()))
	}

	if scanner.Err() != nil {
		return nil, fmt.Errorf("could not scan entry, %w", err)
	}

	return &idx, nil
}

. The buildIndex method is called when initializing the database. We just iterate over all the data in our append-log and adds the key with the corresponding offset to our index.

We are now ready to implement the rest of the engine and this time we will start with the Put / Delete operations.

// Put stores the value for the given key
func (store *Store) Put(key string, value []byte) error {
	record := record.NewValue(key, value)
	return store.append(record)
}

// Delete deletes the data for the given key
func (store *Store) Delete(key string) error {
	record := record.NewTombstone(key)
	return store.append(record)
}

func (store *Store) append(record *record.Record) error {
	if record.Size() > store.maxRecordSize {
		msg := fmt.Sprintf("key-value too big, max size: %d", store.maxRecordSize)
		return kvdb.NewBadRequestError(msg)
	}

	store.writeMutex.Lock()
	defer store.writeMutex.Unlock()

	file, err := os.OpenFile(store.storagePath, os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0600)
	defer file.Close()
	if err != nil {
		return fmt.Errorf("could not open file: %s for write, %w", store.storagePath, err)
	}

	n, err := record.Write(file)
	if err != nil {
		return fmt.Errorf("could not write record to file: %s, %w", store.storagePath, err)
	}

	if !store.async {
		if err := file.Sync(); err != nil {
			return err
		}
	}

	if err := file.Close(); err != nil {
		return err
	}

	store.index.put(record.Key(), int64(n))
	return nil
}

The only thing that differs from the aol implementation is that we update the index just before returning from the append method.

The Get operation is also very similar to the aol implementation:

// Get returns the value for the given key, a kvdb.NotFoundError if the
// key was not found or other errors encountered
func (store *Store) Get(key string) ([]byte, error) {
	offset, ok := store.index.get(key)
	if !ok {
		return nil, kvdb.NewNotFoundError(key)
	}

	f, err := os.OpenFile(store.storagePath, os.O_CREATE|os.O_RDONLY, 0600)
	defer f.Close()
	if err != nil {
		return nil, err
	}

	_, err = f.Seek(offset, io.SeekStart)
	if err != nil {
		return nil, err
	}

	scanner, err := record.NewScanner(f, store.maxRecordSize)
	if err != nil {
		return nil, err
	}

	if scanner.Scan() {
		record := scanner.Record()

		if record.IsTombstone() {
			return nil, kvdb.NewNotFoundError(key)
		}

		return record.Value(), nil
	}

	return nil, kvdb.NewNotFoundError(key)
}

. We just try to get the offset position of the record for the given key, if it does not exist in our index it does not exist in the append-log either so we can return early.

In case the key exist we use the offset and `Seek` to that position in the append-log.

Performance

By adding the index we expect that our Read performance should improve significantly and hopefully it will only have a very limited impact on our Write performance.

Lets compare the performance with the aol implementation:

❯ go test -run=^$ -bench='Benchmark(Aol|IndexedAol)(Write)[0-9]*' ./test | prettybench
goos: darwin
goarch: amd64
pkg: github.com/olif/kvdb/test
PASS
benchmark                              iter       time/iter   throughput
---------                              ----       ---------   ----------
BenchmarkAolWrite100Db100-4           12118     95.99 μs/op    1.23 MB/s
BenchmarkIndexedAolWrite100Db100-4    10000    104.12 μs/op    1.13 MB/s
BenchmarkAolWrite1000Db100-4          11229    111.35 μs/op    9.14 MB/s
BenchmarkIndexedAolWrite1000Db100-4    9364    109.09 μs/op    9.33 MB/s
ok  	github.com/olif/kvdb/test	6.491s

Here we are measuring the time and throughput for writing a single record with a value of byte size 100 resp 1000 bytes without syncing to disk. As we can see, there is a small difference in performance but not significant in favor to the aol implementation.

Now let's check out the difference in Read performance:

❯ go test -run=^$ -bench='Benchmark(Aol|IndexedAol)(Read)[0-9]*' ./test | prettybench
goos: darwin
goarch: amd64
pkg: github.com/olif/kvdb/test
PASS
benchmark                             iter      time/iter
---------                             ----      ---------
BenchmarkAolRead10000Db100-4             3   495.21 ms/op
BenchmarkIndexedAolRead10000Db100-4    814     1.53 ms/op
ok  	github.com/olif/kvdb/test	10.470s

. We are reading a random value from the database with an dataset of 10.000 records and as we can see, the performance gain by using the index is huuuge.

Comparing how the indexed aol scales compared to the aol implementation also gives very positive feedback:

❯ go test -run=^$ -bench='Benchmark(IndexedAol|Aol)ReadFrom*' ./test | prettybench
goos: darwin
goarch: amd64
pkg: github.com/olif/kvdb/test
PASS
benchmark                                iter        time/iter
---------                                ----        ---------
BenchmarkIndexedAolReadFrom10Db-4       78820      13.06 μs/op
BenchmarkIndexedAolReadFrom100Db-4      99619      12.79 μs/op
BenchmarkIndexedAolReadFrom1000Db-4    100755      13.24 μs/op
BenchmarkIndexedAolReadFrom10000Db-4    82573      15.08 μs/op
BenchmarkAolReadFrom10Db-4              76866      14.18 μs/op
BenchmarkAolReadFrom100Db-4             24978      50.28 μs/op
BenchmarkAolReadFrom1000Db-4             3510     383.05 μs/op
BenchmarkAolReadFrom10000Db-4             253    4652.85 μs/op
ok  	github.com/olif/kvdb/test	19.207s

. Here we are reading a random value from a database that contains N records N={10,100,1000,10000} and as we can see, the index gives us near constant time performance compared to linear performance for the aol without an index.

Success.

Summary

In this post we tried to improve the read performance of our database engine by introducing an index.

Indexes always come with a performance tradeoff between Read and Writes. In our case the impact on the write performance was negligible while at the same time the read performance went from O(n) to O(1).

We now have a very simple database engine with good scalability but is still has a couple of issues that we need to resolve. The most important one is that the overall size of the database is monotonically increasing. Even a Delete operation will increase the overall size since a Delete is represented by a tombstone record that is appended to the log. Eventually we will run out of disk.

As usual, we will try to solve this in an upcoming blog post.

comments powered by Disqus