Wednesday, December 09, 2020

Memory Allocation in Rust

I am porting a project from C to Rust. The C project manages all memory allocation using chunk allocators. All memory is freed at the end. Another property of the implementation is that a memory reference to an object is never reused or changed, so that any code that is using the reference doesn't need to worry that the object they refer to will be somehow deleted and the reference will become invalid.

When porting to Rust, I had to decide whether to keep the existing model or adopt a different strategy where each allocation is freed as soon as it is not needed anymore. I decided in the end that I want to retain the existing design because freeing objects at intermediate stage is not only ineffficient, but it is also not always feasible.

In the interest of learning Rust I decided to implement my own allocators in Rust, mimicing how it works in the C code.

One of the first things I needed is a string allocator. In the C version a chunk allocator hands out slices of strings which are then used to cache all possible strings. I won't show the C code here, but it is quite straight forward implementation.

One key point here is that the string slices are guaranteed to be stable, i.e. they do not change or move around.

My first attempt to write this in Rust looked like this:
struct StringChunk {
    chunk: [u8; 1024],
}

struct StringAllocator {
    chunks: Vec>,
    pos: usize
}

impl StringAllocator {
    fn new() -> Self {
        StringAllocator {
            chunks : vec![Box::new(StringChunk{ chunk: [0; 1024] })],
            pos: 0
        }
    }

    fn alloc_string(&mut self, n: usize) -> &mut [u8] {
        let cur = self.chunks.last_mut();
        match cur {
            None => {
                self.chunks.push(Box::new(StringChunk{ chunk: [0; 1024] }));
            }
            Some(p) => {
                if self.pos + n > p.chunk.len() {
                    self.chunks.push(Box::new(StringChunk{ chunk: [0; 1024] }));
                    self.pos = 0;
                }
            }
        }
        let i = self.chunks.len()-1;
        let top = &mut self.chunks[i];
        let pos = self.pos;
        self.pos = self.pos + n;
        &mut top.chunk[pos .. self.pos]
    }
}

This version has the issue that strings are allocated only from the top most chunk. So even if space is available in previous chunks they are not used.

The allocator returns a slice into a byte array. To ensure that we never invalidate a reference, the vector in StringAllocator contains a pointer to a chunk of memory. Thus as the array grows, the references to the slices still remain valid.

One curiousity in Rust I discovered is this:

Box::new(StringChunk{ chunk: [0; 1024] })

Above seems to create the byte array of 1024 elements on the stack and then pass to Box::new() which presumably copies the array to heap memory. There appears to be no placement new equivalent in Rust, although I think some library functions are available to work around this.

Sunday, December 06, 2020

C Enum equivalent in Rust

In C code we often use enums to represent constants like so:

enum TokenType {
	TOK_OFS = 256,
	TOK_and,
	TOK_break,
	TOK_STRING,

	FIRST_RESERVED = TOK_OFS + 1,
	LAST_RESERVED = TOK_while - TOK_OFS
};

Rust also has an enum type but it is not at all like the C enum. The Rust enum is more like a discriminated union in C.

Superficially though you can almost write something like above in Rust.

enum TokenType {
    TOK_OFS = 256,
    TOK_and,
    TOK_break,
    TOK_STRING,

    FIRST_RESERVED = TOK_OFS + 1,
    LAST_RESERVED = TOK_while - TOK_OFS
}

But this will not compile. Firstly Rust expects explicit conversion from enum to int, so you can try:

enum TokenType {
    TOK_OFS = 256,
    TOK_and,
    TOK_break,
    TOK_STRING,

    FIRST_RESERVED = TOK_OFS as isize + 1,
    LAST_RESERVED = TOK_while as isize - TOK_OFS as isize
}

However this will not work either because an enum in Rust is not really a constant. The enum discriminant value needs to be unique so we cannot have two instances TOK_and and FIRST_RESERVED with the same value.

Perhaps we could try this:

enum TokenType {
    TOK_OFS = 256,
    TOK_and,
    TOK_break,
    TOK_STRING,

}

const FIRST_RESERVED :isize =  TOK_OFS as isize + 1;
const LAST_RESERVED: isize  = TOK_while as isize - TOK_OFS as isize;

This compiles, but the enums are a pain to use as constants because of the need to explicitly convert to integer values.

In the end I ended up doing:

const TOK_OFS: i32 = 256;

const TOK_and: i32 = 257;
const TOK_break: i32 = 258;
const TOK_STRING: i32 = 301;

const FIRST_RESERVED: i32 = TOK_OFS + 1;
const LAST_RESERVED: i32 = TOK_while - TOK_OFS;

Not great but it gives me what I need.

Beginning Rust

I am translating one of my projects to Rust as a way of learning Rust. To make things more interesting, I am trying to implement my own memory allocators and data structures. After all, in C or C++ that is something I can do easily, so it is worthwhile figuring out how much effort this will be in Rust.

Here is my first piece of code. Lets first look at the C version and then at my attempt to do this in Rust.

struct lexer_state {
	const char *buf;
	size_t bufsize;
	size_t n;
	const char *p;
};  
struct lexer_state *raviX_init_lexer(const char *buf, size_t buflen)
{
	struct lexer_state *ls = (struct lexer_state *)calloc(1, sizeof(struct lexer_state));
	ls->buf = buf;
	ls->bufsize = buflen;
	ls->n = ls->bufsize;
	ls->p = ls->buf;
	return ls;
}
enum { EOZ = -1 }; /* end of stream */
#define cast_uchar(c) cast(unsigned char, c)
static inline int zgetc(struct lexer_state *z) { return z->n-- > 0 ? cast_uchar(*z->p++) : EOZ; }

The goal here is to take a buffer as the input source and return one character at a time, or EOZ when the input is exhausted.

What you can observe above is that the buffer is supplied by the caller, and the C code assumes that the caller will ensure that the buffer is valid as long as lexer_state is active.

Here is my attempt to do this in Rust. Bear in mind that I am new to Rust and this is my first ever Rust code, therefore I may not be doing this the best possible way.

pub struct Source<'a> {
    len: usize,
    bytes: &'a [u8],
    n: usize,
}

pub const EOZ: i32 = -1;

impl<'a> Source<'a> {
    pub fn new(input: &'a str) -> Source {
        Source {
            len: input.len(),
            bytes: input.as_bytes(),
            n: 0,
        }
    }

    pub fn getc(&mut self) -> i32 {
        let ch = if self.n >= self.len {
            EOZ
        } else {
            self.bytes[self.n] as i32
        };
        self.n += 1;
        ch
    }
}

The Rust version takes a string as input, and every time, getc() is called, it returns a byte from the input string, or EOZ if the input is exhausted.

The main difference in the Rust version is that the compiler tracks that the input is being referenced in the Source struct so that it can ensure that the input is valid as long as the Source struct is active.

The Rust code contains lifetime annotations such as 'a which is not something I could handle as a beginner, but fortunately the Visual Studio Code Rust plugin is helpful enough to let me know that the annotation is necessary and also insert it in the right place. I believe the goal of the lifetime annotation is to link the lifetime of the input string to the byte array reference inside the Source struct.

Saturday, July 06, 2019

Better late than never: making Linux the main development platform

Although I have been developing on Linux for many years, Windows has always been my primary development environment. Usually I develop first on Windows, and then test on Mac OSX and Linux. But recently as I started to get more into Docker, it has started to make more sense to develop on Linux first.

I have chosen RHEL 7.x as my main development platform. This is unusual choice I guess. The availability of the developer edition made it possible. I like that Red Hat makes available the latest versions of the programming languages via their developer tool sets. On Ubuntu I was always using older versions.

Friday, March 29, 2019

Back from C# to Java

I wanted to write a quick post about my move back from C# to Java.

In 2016 I chose to use C# as the language for a system I was developing. The release of .Net core prompted this move. The key benefits I expected from C# were:
  • Higher productivity
  • Ability to generate native executables - unfortunately, this did not materialize as it seems the AOT functionality in .Net Core was not a priority on server platforms
  • Easier integration with external C libraries
  • More memory efficient implementation due to support for primitive types in containers, Struct types etc.
Apart from the disappointment with AOT compilation, C# met my expectations. So then why move back to Java?

Well, primarily because of the Java eco-system. Of course Java has improved significantly in terms of developer productivity since Java 8. It even has variable declarations with type inference now. However what sets Java apart is the huge eco-system for server side development, largely because of Apache and Spring projects. With .Net Core I struggled even with basic stuff such as application logging. Maybe this has changed now, but .Net Core 1.0 didn't have any standard way of logging which meant I had to roll out my own.

One thing I am not convinced about is the proliferation of 'async/await' style programming in C#. It just seems wrong that your program will be converted to a state machine. I think if this has to be done, then the approach adopted by Go is better. Anyway, I am digressing.

Thursday, December 22, 2016

SimpleDBM - a NoSQL Transactional DB in Java

The goal of the SimpleDBM project was to primarily teach myself how DBMSes work. In that goal it succeeded I think, and it also was great fun researching all the computer science literature on database technology and applying the techniques invented by great pioneers in this area.

I highlighted some of the sources I used in the implementation of SimpleDBM in this blog post.

Unfortunately due to lack of time I have not been able to devote much time to SimpleDBM in the past few years. So new features are not being implemented, the project is in maintenance mode; that is, I will fix bugs reported.

It is not possible to know if anyone is using SimpleDBM or not. I have not used it in anger in a Production environment so in that sense it is not Production software. However, I think its main value today is pedagogical in that it can be used to understand and learn the traditional techniques used to implement database engines. The implementation is much better documented than any other opensource DBMS I have come across. This is partly because I once thought of writing a book on how to implement a DBMS.

The implementation handles some of the hard problems, such as transactions, write ahead logging, concurrent and recoverable BTREE operations, deadlock detection, etc.

The project is now hosted on GitHub.  For anyone just wanting to use it Maven packages are available.

Wednesday, December 07, 2016

What's great about Lua?

Lua is an amazing programming language implementation. I say implementation because it is not just the language itself but how it is implemented that is particularly impressive.

As a programming language, Lua can be characterised as a small but powerful language. The power comes from clever use of a few core meta-mechanisms as Lua authors like to put it. A nice introduction to some of these are in the recent talk by Roberto Lerusalimschy at the Lua Workshop 2016.

I used to think that Lua is a simple language; but appearances are deceptive. I now think of Lua as 'small' and 'powerful' language rather than a 'simple' language.

The language design is clever, but the implementation is what makes it great.

Firstly it is a very compact implementation, just a few C source files, and that's it. No dependencies other than an ANSI C compiler.

Secondly, despite the compact implementation, it features:

  • A byte-code compiler and Virtual Machine.
  • An incremental garbage collector.
  • Extremely fast parser and code generator.
  • And the language is delivered as a library with an extremely well designed C API, that makes it easy to embed Lua as well as extend it.
It is this combination of economical design and beautiful implementation that makes Lua great.

Lua 5.3 Bytecode Reference

Lua bytecodes are not officially documented as they are considered to be an implementation detail. The best attempt to document Lua bytecodes is the A No-Frills Introduction to Lua 5.1 VM Instructions by Kein-Hong Man. However this document is quite old now and does not reflect the changes made since Lua 5.2.

Some time ago I started an attempt to bring this document up-to-date for Lua 5.3. I recently managed to spend a few hours updating the new Lua 5.3 bytecode reference. This is still not complete but the most important bytecodes are covered.

I want to eventually produce a version that is like a specification, i.e., one that allows independent implementations to replicate the bytecode generation. My interest in this is due to my desire to create a new parser and code generator for Lua. 

Thursday, August 25, 2016

Unique problems of a start-up

Well recently I took the plunge and started my own tech start-up after years of thinking about it. So I have been battling the usual things that many start-ups face I guess.

Of great help are inspirational talks such as these.

First is the talk by creator of Ruby on Rails - David Heinemeier Hansson at Startup School 08 where he talks about how to create a successful start-up.


Next is this talk by Walter Bright where he talks about how he started the programming language D, and he describes the way he works.


My journey has only just started. I would love to discuss some of the issues that start-ups face and how I am trying to solve the ones I face.