What is a domain-specific language and why would I want one?
Domain-specific languages are languages designed for working in a specific domain, meaning they will typically have specialised syntax or structures appropriate to the domain; think SQL, and its declarative nature and how that relates to the structured data relational databases are comprised of - although you could imagine a database being queried using an imperative language like Python it's hardly suited (at least in its traditional form).
What we're talking about today is an implementation of a domain-specific language within Python, that is, leveraging some of the features Python provides to create a syntax that is more conducive to a particular task. There are many applications for this style of programming, however in this case we'll implement a data query-like language, similar to that provided by SQLAlchemy, using descriptors, operator overloading, and a few other tricks.
In most respects this is just syntactic sugar, however as you'll see the powerful and transparent nature of Python's data model actually affords us quite a bit of customisability.
As we go through the article we'll build up some snippets of source code, if you'd like to have the full source to refer to as you go (or you just like skipping to the end without having to scroll) you can find it here: https://bitbucket.org/hatchd/hatchd-blog-dsls
What are descriptors?
Descriptors are the primary language feature we'll use to implement our syntax.
In essence a descriptor allows you to control access to attributes on an object. Because they are implemented as attributes on a class (as distinct from a property on an object) they enable some of the fundamental techniques that we use to construct a domain-specific language. If you've used Python before you've no doubt come across the @property decorator - this is close to the simplest implementation of a descriptor, causing getter and setter methods to be called when an attribute is accessed:
>>> class MyClass(object): ... @property ... def my_prop(self): ... print('getting my_prop') ... return 1 ... @my_prop.setter ... def my_prop(self, value): ... print('setting my_prop to', value) >>> o = MyClass() >>> print(o.my_prop) getting my_prop 1 >>> o.my_prop = 1 setting my_prop to 1 >>> print(MyClass.my_prop) #doctest: +ELLIPSIS <property object at...>
Let's get started!
First, let's take a look at what we're aiming for. We are going to build a portion of an ORM, just the part that we would use to create a query (the actual database access and query emitter will be left as an exercise for the reader :)). Let's assume we have an Order class and a corresponding table in our imaginary database, this is what we want to be able to do:
from ql import Query my_query = Query(Order).where(Order.value >= 500, Order.item_count / Order.value > 100)
Here we're asking our imaginary database for all orders that have a total value of more than 500 imaginary dollars, and an average item value of at least 100 imaginary dollars. Points of note:
- We're using attributes on the Order class to refer to columns in the database. This has the advantage that your IDE / linter will catch typos.
- The Query callable accepts a series of expressions. "Order.value >= 500" is one expression, "Order.item_count / Order.value > 100" is another expression.
We will need to teach Python how an expression like "Order.value >= 500" can evaluate to something that our query parser can understand. The magic at play here is - you guessed it - descriptors.
Let's start with a plain old Python object - an Order class. We want our ORM to work with nice Python objects that we can use normally so this is a logical starting point:
class Order(object): def __init__(self, value=None, item_count=None): self.value = value self.item_count = item_count def __repr__(self): return 'Order(%s, %s)' % (repr(self.value), repr(self.item_count))
Simple stuff. Note that instances of Order have a value attribute:
>>> Order(1).value 1
but of course Order itself does not:
>>> Order.value AttributeError: type object 'Order' has no attribute 'value'
What we need is a value attribute on the Order type that lets us control an expression like "Order.value > 1", and an attribute on Order instances that act like normal data attributes. Descriptors can help us here: they are present at the class level, and can control attribute access at the instance level. Let's implement a simple descriptor that controls access to our value attribute first, before we look at what we can do with descriptors when they are referenced at the class level.
We'll simply control access to 'value' by reading it and writing it to value instead. Paraphrasing the documentation, a descriptor is an object with a __get__ and a __set__ method (and a __delete__ method, but let's not worry about that):
class ValueDescriptor(object): def __get__(self, obj, objtype=None): print('__get__', self, obj, objtype) return getattr(obj, '_value', None) def __set__(self, obj, value): print('__set__', self, obj, value) setattr(obj, '_value', value) class Order(object): value = ValueDescriptor() def __init__(self, value=None, item_count=None): self.value = value self.item_count = item_count
Now we should have a value attribute that acts (relatively) normally on instances of Order, and does something when referenced at the class level. What is that something? Let's see:
>>> o = Order() __set__ <__main__.ValueDescriptor object at ...> <__main__.Order object at ...> None >>> print(o.value) __get__ <__main__.ValueDescriptor object at ...> <__main__.Order object at ...> <class '__main__.Order'> None >>> o.value = 5.0 __set__ <__main__.ValueDescriptor object at ...> <__main__.Order object at ...> 5.0 >>> print(o.value) __get__ <__main__.ValueDescriptor object at ...> <__main__.Order object at ...> <class '__main__.Order'> 5.0 >>> print(Order.value) __get__ <__main__.ValueDescriptor object at 0x10b501f28> None <class '__main__.Order'> None
The last part is the most interesting. Order.value.__get__() was called, but when accessed in this way obj is set to None (which caused a return of None, since getattr(None, '_value', None) returns None). This is interesting because we can use this to determine how the attribute is being accessed, and if it's being accessed from the class context, we can return something which will allow us to control how an expression like Order.value > 5 is evaluated.
The seasoned OO programmers among you will see where we need to go next: operator overloading.
Before we move on let's take a brief look at what we might want in our Query objects such that we could use it to generate a database query. Let's keep it simple with the bare minimum:
- Our Order class will most likely correspond to a table or collection or whatever our target database would like, so we need to know that in our query object.
- We need one or more conditions or predicates in order to specify which records should be returned.
- For use in our predicates we need operators like equals or greater than.
- To represent an expression like "Order.value / Order.item_count" we need something to represent a computed value or an operation between fields and constants, which we could then pass to our database. This is a bit more advanced so let's tackle this last.
A real query language needs things like ANDs and ORs, sorting and limits, but let's forget about that for now and concentrate on tables and predicates. Here's some simple classes we'll use:
class Predicate(object): def __init__(self, field, operator, value): self.field = field self.operator = operator self.value = value class Query(object): def __init__(self, query_class, predicates=None): self.query_class = query_class self.predicates = predicates or  def where(self, *predicates): return Query(self.query_class, self.predicates + list(predicates))
Note that where() returns a new Query, so can be chained with itself or any other such methods we add. This is key to the syntax we want to provide.
Although Python includes few facilities for controlling mutability we ensure here that the methods we provide return new instances, rather than mutating existing instances. This is not essential, however is recommended as it improves predictability - the functions we provide will never alter a reference to an existing Query you might have hanging around.
We'll also add a couple operators to use as constants, see the accompanying source code for a more complete implementation (these are in the ql.op module):
OP_EQ = 'eq' OP_GTE = 'gte'
So to construct a simplified version of our original query using this purely Python syntax we would do something like:
Query(Order).where(Predicate('value', OP_GTE, 500))
Acceptable, but you can see why our original syntax is preferred. Let's see how to achieve this.
In this simple case really all we need is that Order.value >= 500 returns a Predicate instance. This is simply achieved using operator overloading in combination with our checks for how the descriptor is being used:
class ValueDescriptor(object): def __get__(self, obj, objtype=None): if obj: return getattr(obj, '_value', None) else: # Descriptor is being accessed from class context return self def __set__(self, obj, value): setattr(obj, '_value', value) def __eq__(self, other): return Predicate('value', OP_EQ, other)
Note that we could return anything in __get__, it doesn't need to return the descriptor instance itself, however it's convenient to do so as it's a natural place to implement the methods for operator overloading.
And that's the core of it. But let's take it a bit further: first we're going to do a bit of refactoring to make our descriptors reusable, and then we'll look at adding a few more useful features.
Nice and tidy
What's in the way of our descriptor being reusable? Mainly it's that it needs to know which attribute it's controlling, so that it can store it, so let's fix that. The easy way to do this is to set it in the constructor; this introduces some duplication, but let's keep it simple and look at that later. We'll also change our attribute storage to use the __dict__ attribute, which is where our attribute data would go if we weren't messing with it:
class DataDescriptor(object): def __init__(self, name): self.name = name def __get__(self, obj, objtype=None): if obj: return obj.__dict__[self.name] else: return self def __set__(self, obj, value): obj.__dict__[self.name] = value def __eq__(self, other): return Predicate(self.name, OP_EQ, other) class Order(object): item_count = DataDescriptor('item_count') cost = DataDescriptor('cost')
Great! But that duplication of the field name is ugly and error-prone. If you're using Python 3.6 the solution here is really easy: there's a method called __set_name__ that Python will call when a descriptor is bound to a name on a class. If you have the option of using Python 3.6 I strongly recommend it, however it's not always practical so let's use an alternate method.
Ideally the name setting needs to happen when the class is created, not when instances of the class are created. This sounds suspiciously like customising class creation, which is in the realm of metaclasses. The full power of Python's metaclasses are beyond the scope of this article, but just tuck this thought into the back of your mind: if you want to customise the creation of a class look into metaclasses.
Here's a snippet that will do what we want:
class Queryable(type): def __new__(cls, name, bases, namespace, **kwargs): for k, v in namespace.items(): if isinstance(v, DataDescriptor): v.name = k result = type.__new__(cls, name, bases, dict(namespace)) return result class Order(@six.with_metaclass(Queryable)): item_count = DataDescriptor() cost = DataDescriptor()
six is used here so the code works equally well with Python 2 or Python 3. Note also that Queryable is a subclass of type, rather than of object. I encourage you to read about Python's metaclasses to understand why.
Note that we could also do this using a class decorator or perhaps provide a registration API (both of which are perfectly valid), however metaclasses are a bit more powerful so their use is worth demonstrating.
The home stretch
Remember our friend, computed attributes? Let's take a look at that, add in a few small niceties, then stand back and admire our work. Here's the full query we want to handle:
Query(Order).where(Order.value >= 500, Order.item_count / Order.value > 100)
Again, the key is operator overloading. First let's handle Order.item_count / Order.value by overloading the divide operator to return an object we can work with. Division is a binary operation so we'll name our class accordingly (perhaps in future we want to handle unary operations like !):
class BinaryOperation(object): def __init__(self, left, op, right): self.left = left self.op = op self.right = right class DataDescriptor(object): def __init__(self, name): self.name = name def __get__(self, obj, objtype=None): if obj: return obj.__dict__[self.name] else: return self def __set__(self, obj, value): obj.__dict__[self.name] = value def __eq__(self, other): return Predicate(self.name, OP_EQ, other) def __div__(self, other): return BinaryOperation(self, OP_DIV, other) __truediv__ = __div__ # Division operator calls different methods in 2/3
And finally, BinaryOperation needs to learn comparison. For the sake of convenience we'll allow our Predicate type to hold BinaryOperations as well; in reality this might not be very convenient and you might prefer create a better abstraction that covers attributes, operations, and even things like and and or, but this will do for our purposes:
class BinaryOperation(object): def __init__(self, left, op, right): self.left = left self.op = op self.right = right def __gt__(self, other): return Predicate(self, OP_GT, other)
So now we can run our query, and it creates a structure that we could use to build a query for our database. To illustrate the point in the attached repo we've implemented a method that translates your query into quasi-English (as well as filling in all those missing operators):
>>> query = Query(Order).where(Order.item_count > 1, Order.item_count < 4) >>> print(query) Return rows from Order where item_count is greater than 1 and item_count is less than 4
Generating a database query is of course going to be a lot more complicated than this, but you can see how achievable it is once you've built the correct structures. This is all done in the __str__ methods of our classes, check the source code for the implementation.
And just for fun I've added a quick implementation of ordering. By now you'll see at a glance how this is done, but it's of course in the source code as well:
>>> query = Query(Order).where(Order.item_count > 1, Order.item_count < 4) >>> query = query.order_by(Order.item_count.desc()) >>> print(query) Return rows from Order where item_count is greater than 1 and item_count is less than 4, order by item_count descending
We hope you can see how powerful these techniques can be in combination - take, for example, PonyORM's absolutely fantastic use of iterators to mimic SQL:
select(c for c in Customer if sum(c.orders.price) > 1000)
or the way Plumbum overloads the pipe and greater than operators to create a shell-like syntax:
>>> chain = ls["-a"] | grep["-v", "\\.py"] | wc["-l"] >>> print chain /bin/ls -a | /bin/grep -v '\.py' | /usr/bin/wc -l >>> (ls["-a"] > "file.list")()
Thanks for reading. You can find the source code here: https://bitbucket.org/hatchd/hatchd-blog-dsls
Interested in working with us?
Whether you have a clearly defined product brief or you're not sure wherein the problem lies, drop us a line for a no-pressure chat about where you are at and how we might help.