Once you have saved a row or two, you will probably want to recall them at some point. Table objects possess a handful of methods to make retrieving their data easy. Database objects have a couple more methods for retrieving data from multiple, joined Tables. But first, we need to talk about Geniusql's real genius: the use of Python expressions instead of SQL.
Almost all languages use expressions, and SQL is no exception. When you write an SQL statement, you use expressions in several places:
SELECT attribute-expression FROM join ON relation-expression WHERE restriction-expressionWhat Geniusql allows you to do is write all of those expressions in Python instead of SQL. It achieves this by providing an Expression class in the logic module.
There are several shortcuts built into Geniusql's API, but eventually, you're going to end up with a logic.Expression object. You can build an Expression directly, passing a single lambda as an argument:
>>> from geniusql import logic >>> import datetime >>> f = lambda x: x.Date >= datetime.date(2004, 3, 1) >>> e = logic.Expression(f) >>> e logic.Expression(lambda x: x.Date >= datetime.date(2004, 3, 1))Neat, eh? I worked hard on that __repr__. ;)
It may be obvious, but we'll be explicit, here. The lambda which you pass into an Expression must possess a positional argument, which will always be bound to a Table. In the example above, it's named 'x', but you can use any name you like. Using lambdas as a base means that we can simply call e(row object), and receive a boolean value indicating whether our row "passes the test". Attribute lookups on our 'x' object will apply to column values for that row object. That is, x.Date maps to row['Date'].
You can also do fancier things with Expressions (although the vast majority of the time, you won't need to in order to use Geniusql):
>>> logic.Expression(lambda x, y, z: "Dave" in x.Name and y.Age > 65)
logic.Expression(lambda x, y, z: ('Dave' in x.Name) and (y.Age > 65))
>>> logic.Expression(lambda *units, **kw: units and
... (units[0].Width > units[0].Height or
... units[0].Color in kw['Colors']))
logic.Expression(lambda *units, **kw: (units) and
((units[0].Width > units[0].Height) or
(units[0].Color in kw['Colors'])))
What is not obvious from the above code snippet is perhaps the most important aspect of Expressions: any globals or cell references (from closures) in the supplied lambda get bound early. Compare the following disassemblies:
>>> import dis
>>> dis.dis(f)
1 0 LOAD_FAST 0 (x)
3 LOAD_ATTR 1 (Date)
6 LOAD_GLOBAL 2 (datetime)
9 LOAD_ATTR 3 (date)
12 LOAD_CONST 1 (2004)
15 LOAD_CONST 2 (3)
18 LOAD_CONST 3 (1)
21 CALL_FUNCTION 3
24 COMPARE_OP 5 (>=)
27 RETURN_VALUE
>>> dis.dis(e.func)
1 0 LOAD_FAST 0 (x)
3 LOAD_ATTR 1 (Date)
6 LOAD_CONST 6 (datetime.date(2004, 3, 1))
9 COMPARE_OP 5 (>=)
12 RETURN_VALUE
As you can see, the function itself references the global 'datetime' module.
Once we wrap it in the Expression, however, it becomes a constant! Thanks to
Raymond Hettinger for inspiring this solution [1].
Early binding, however, implies two consequences:
First, any globals or cell references must be present in the lambda's scope when it is passed into Expression(). This is the norm and shouldn't require too much thought from you when you write Expressions. In the example above, we simply imported datetime as you would expect.
Second, any globals or cell references must also be present in the logic module's environment when the Expression is unpickled. Pickling occurs when Expressions are sent over sockets, and also when Expressions are themselves persisted to storage. This means your application must add any globals into the logic.builtins dict. Note that the logic module already tries to import datetime, fixedpoint and decimal.
Geniusql provides additional functions which can be used in Expressions. For example, you can construct an Expression like:
logic.Expression(lambda x: x.Size < 3 and x.Date > today())In this example, the today() function breaks convention and is actually bound late. That is, if you construct this Expression now and use it six months later, the value of today() will change. Expressions "know about" these geniusql functions, and can use them to build more appropriate queries. Here are the functions supplied by the geniusql.logicfuncs module:
| Function | Late bound? | Description |
|---|---|---|
| icontains(a, b) | Case-insensitive test b in a. Note the operand order. | |
| icontainedby(a, b) | Case-insensitive test a in b. Note the operand order. | |
| istartswith(a, b) | True if a starts with b (case-insensitive), False otherwise. | |
| iendswith(a, b) | True if a ends with b (case-insensitive), False otherwise. | |
| ieq(a, b) | True if a == b (case-insensitive), False otherwise. | |
| year(value) | The year attribute of a date. If value is None, return None. | |
| month(value) | The month attribute of a date. If value is None, return None. | |
| day(value) | The day attribute of a date. If value is None, return None. | |
| now() | Y | datetime.datetime.now() |
| utcnow() | Y | datetime.datetime.utcnow() |
| today() | Y | datetime.date.today() |
| iscurrentweek(value) | Y | If value is in the current week, return True, else False. |
It is possible for you, the application developer, to define your own external functions by adding them to the builtins dict of the logic module. For example, logic.builtins['odd'] = lambda x: (x.num % 2) == 1. However, because the builtin providers are unaware of your new functions, they will not be able to optimize their use; instead, they will simply retrieve a larger set of rows from storage, evaluate each one against the function you provide, and return those rows which match your function. This isn't necessarily a bad thing; it provides the same functionality as if you wrote the test inline within your own code. By making that test a logic function, you allow it to be shared across your codebase. You may, of course, create your own provider which understands your external function (and can translate its logic into SQL), and thereby achieve end-to-end functionality.
The logic module also provides convenient methods to create common types of Expression objects via the filter and comparison factory functions.
The filter(**kwargs) function produces an Expression by taking the keyword arguments you supply, and rewriting them in lambda form. The only operator allowed is therefore the equals '==' operator. For example:
>>> logic.filter(Type='Cat', Mutation='Atomic') logic.Expression(lambda x: (x.Type == 'Cat') and (x.Mutation == 'Atomic'))
The comparison(attr, cmp_op, criteria) function allows you to form Expressions with dynamic operators. This can come in handy when you are constructing Expressions on the fly from user input. For example, a search page might prompt users for an attribute name, an operator, and an operand (the criteria).
Borrowing from opcode.cmp_op, the allowed values for our cmp_op argument are as follows:
| Numeric Value (cmp_op) | Operator |
|---|---|
| 0 | < |
| 1 | <= |
| 2 | == |
| 3 | != |
| 4 | > |
| 5 | >= |
| 6 | in |
| 7 | not in |
| Most providers only support the following with None: | |
| 8 | is |
| 9 | is not |
Here's an example of using comparison:
>>> logic.comparison('Name', 3, 'Mr. Kamikaze')
logic.Expression(lambda x: x.Name != 'Mr. Kamikaze')
Although the comparison function only allows a single comparison at a time,
the resulting Expressions can be combined with the & and |
operators (see next) to produce more complex Expressions.
Expressions are combinable; by using the & operator, the two expressions are combined with an adjoining logical "and". For example:
>>> a = logic.Expression(lambda x: x.Size > 3) >>> b = logic.Expression(lambda x: x.Size <= 15) >>> c = a & b >>> c logic.Expression(lambda x: (x.Size > 3) and (x.Size <= 15))The + operator works just like the & operator. The | operator combines the two Expressions with a logical 'or'.
When you combine two Expressions with dissimilar argument lists, what happens? The Expression class doesn't really care what the argument names are, just their order, so the names might not come out as you might expect; however, the logic is preserved:
>>> f = logic.filter(Name='Bruce')
>>> f
logic.Expression(lambda x: x.Name == 'Bruce')
>>> g = logic.Expression(lambda a, b, **kw: a.Name + b.Surname == kw['Full Name'])
>>>
>>> f + g
logic.Expression(lambda x, b, **kw: (x.Name == 'Bruce')
and (x.Name + b.Surname == kw['Full Name']))
>>> g + f
logic.Expression(lambda a, b, **kw: (a.Name + b.Surname == kw['Full Name'])
and (a.Name == 'Bruce'))
Up to now, we've constructed Expression objects with a single argument, the function which we're going to wrap. But Expression objects may take a second argument, called "kwtypes". This argument must be a dictionary of {keyword: type} pairs. Geniusql doesn't do anything internally with this information; it's simply a standard place to keep it for use by your own applications. However, the kwtypes attribute will be persisted when pickling and unpickling Expression objects, a very common operation.
The logic module (and codewalk, on which it is built) isn't limited to Geniusql. Feel free to use it in some other framework or script! The only change you may have to make (if you relocate the module outside of the geniusql package) would be to the single line: from geniusql import codewalk, to point to the new location.
In particular, logic.Expression objects can operate on any Python objects, not just geniusql data. If you wish to provide additional logic functions (as geniusql does), simply add them to logic.builtins.
You may also find the underlying codewalk module useful for other purposes on its own. The Visitor base class can be very convenient for building bytecode hacks.
To make a long story short, Geniusql depends on logic throughout, but the reverse is not true.
Now that we've discussed Expression objects in all their complexity, it's time to make things easy again.
Most functions that take Expression arguments are equally happy with a lambda and/or a set of keyword arguments, which will then be combined into an Expression for you. The following three examples are equivalent and all result in the same output:
>>> data = Book.select_all(Year=1928) >>> data = Book.select_all(lambda x: x.Year == 1928) >>> data = Book.select_all(logic.Expression(lambda x: x.Year == 1928)) >>> [row['Title'] for row in data] [u'The Giant Horse of Oz', u'Kai Lung Unrolls His Mat', u'Tarzan, The Lord of the Jungle']
First, the Table class has a simple method for retrieving a single row: the select(restriction=None, **kwargs) method. If a row which matches the given restriction-plus-kwargs can be found, the first such match is returned (as a dict). If no match is found, this method returns None.
The select_all(restriction=None, **kwargs) method works just like select, but returns multiple rows (as a list of dicts). If you do not supply a restriction Expression or any keyword args, all rows of the given Table will be retrieved. If no matches are found, an empty list is returned.
The select(query, order=None, distinct=False, limit=None, strict=True) method is a lot more powerful than the Table methods. Instead of a single restriction expression, therefore, it takes a Query object. In essence, a Query is an object which holds all three SELECT expressions: the relation, the attributes, and the restriction. We'll talk about Query objects next.
The db.select method doesn't return rows directly; instead, it returns a Dataset object. If you just want to iterate over the matching rows, do so directly:
attrs = Twits.keys()
for row in db.select((Twits, attrs, lambda t: t.class == 'Upper')):
row = dict(zip(attrs, row))
print row['Name']
Note that the Dataset yields lists, not dicts. It's up to you to zip the
attribute names you passed in with the returned values (if needed).
The Dataset object has additional attributes which allow nearly-complete knowledge about the SELECT; you can inspect input and output columns, as well as see any aliases used in the SQL.
If the 'strict' argument to db.select is False, the rows returned may be a superset of matching rows; you must filter them again, usually by forming a 'row object' and passing it through the restriction function. At first glance, this looks dangerous and not very useful, but it's actually a very important tool. It allows restriction expressions to contain logic which is not expressible in SQL (due to provider limitations), but which is expressible in Python. If the SELECT writer cannot perfectly translate a given subexpression from Python to SQL, it merely replaces the subexpression with TRUE, returns a larger set of rows, and asks you to use Python to do the final filtering. It does this by setting the flag Dataset.selector.imperfect = True. In this way, we can sometimes trade a bit of speed for correctness/robustness.
If the 'strict' argument is True (the default), any Python which cannot be perfectly converted to SQL will raise a ValueError.
Database objects can also take a SELECT and create a new table from the output (often called materialized views). The Database.insert_into(name, query, distinct=False, limit=None) method inserts data into a new table; the 'name' argument will be used to name the new table in the database. A new geniusql.Table object will be returned, and will also be added to the schema; the key will be the provided name. If you want to use a different key for the Table, call Schema.alias(name, newkey) after this method.
Query objects have the signature Query(relation, attributes, restriction=None). As a first argument, you pass it the Table (not the name of the Table, but the actual table object) for which you expect to retrieve data. If you want to retrieve data from multiple Tables in a join, then pass a Join object instead (see below).
The second argument is a set of attributes. This can take a couple of forms:
[(table1.col1.name, table1.col2.name, ...), (table2.col1.name, ...), ...]The order of sequences must match the order of tables given in the relation (and therefore the restriction args, if applicable).
The third argument should be an expression which encapsulates your restriction. This will be used to construct a WHERE clause. The args must be in the same order as the tables in the relation.
In the near future, we expect to allow callers to pass generator expressions which can then be transformed into Query objects behind the scenes. For example:
db.select((t1.a, t1.b + t2.a for t1, t2 in relation if t1.a > 13))Until then, you generally pass a tuple of (relation, attribute, restriction) values to any function that has a 'query' argument, and Geniusql forms a Query object for you from them.
In addition to providing a single Table to select, you have the option of providing a tree of Tables, a nested set of Join(table1, table2, leftbiased=None) instances.
The "leftbiased" argument specifies how the results will be joined:
| leftbiased | Join Type | Description | Operator |
|---|---|---|---|
| None | Inner Join | All related pairs of both tables will be returned. | & or + |
| True | Left Join | All related pairs of both tables will be returned. In addition, if any row in table1 has no match in table2, we return a single row from table1 and a "null row" for table2 (a row, all of whose values are None). | << |
| False | Right Join | All related pairs of both tables will be returned. In addition, if any row in table2 has no match in table1, we return a single "null row" for table1 (a row, all of whose properties are None) and the row from table2. | >> |
Look hard? Fear not. There's a much easier way to join tables than writing a big tree of Joins. Use the &, <<, and >> operators directly with Tables:
tree = (Book << Publisher) & Author
This example will automatically produce a Join tree for you, with Book 'left joined' to Publisher, and then 'inner joined' to Author.
When you provide multiple tables, the select method returns a list of rows. Each row will be a list of values, one per attribute in the attributes arg. The restriction arg should be a logic.Expression which can evaluate all of the units in any given row at once.
for row in db.select((Publisher & Book, [Publisher.keys(), Book.keys()],
lambda p, b: p.ID == 4)):
This example will retrieve a series of rows; all Publisher values
first, followed by all Book values. Note that all three constructs
(the Join, the lambda arguments, and the resulting rows) have the same
tables listed in order from left to right. The default relationship
between Publisher & Book is assumed to have been specified by a
reference).
e = logic.Expression(lambda x, y: x.Size > 3 and y.Age == x.Age) j = Donkey & Horse...which Table maps to the 'x' argument, and which to the 'y' argument?
Query objects (and most other parts of Geniusql) rely on a simple convention for matching tables to the argument list of Expressions: use the same order of evaluation as the relation (join). In the example above, the Donkey table is evaluated first, so it will map to 'x', and the Horse table will map to 'y'. When you have a single Table, of course, this is trivial, and the vast majority of remaining cases will be of the form: A & B or A << B. In the rare cases where you have crazy n-way joins, use parentheses liberally to help make the argument order (and even the join order) clear to yourself: ((A << B) & (C & D)) >> E.
Usually, your code will call transaction methods on the db.connections object. When you call connections.start(isolation=None), you are telling Geniusql to begin a transaction.
The isolation argument to the start method is very important. It determines the "isolation level" of the transaction; that is, the degree to which the current transaction can see changes made to a concurrent transaction. Use the connections.isolation_levels list to learn which isolation levels are available for your current provider. The ANSI/ISO SQL92 standard defines four isolation levels, based on three phenomena:
| Level | |||||
|---|---|---|---|---|---|
| Read Uncommitted | Read Committed | Repeatable Read | Serializable | ||
| P h e n o m e n a |
Dirty Read | Possible | Not possible | Not possible | Not possible |
| Fuzzy Read | Possible | Possible | Not possible | Not possible | |
| Phantom | Possible | Possible | Possible | Not possible | |
A "dirty read" occurs when TX 1 writes a value and TX 2 is able to read the change before TX 1 commits.
A "fuzzy read" (or "nonrepeatable read" occurs when TX 1 reads a value, TX 2 changes that value and commits, and TX 1 obtains the new value when it re-reads.
A "phantom" occurs when TX 1 reads a set, TX 2 adds to the set, and TX 1 obtains the new rows when it re-reads.
Geniusql supports a variety of stores, and not every store supports every isolation level. See the comparison chart for details. Some stores "prevent" the above phenomena in different ways. In some cases, the phenomena is simply not allowed to exist. In other cases, the phenomena raises an error immediately. In still other cases, the phenomena is prevented by waiting (up to a timeout) until one of the offending transactions completes.
Once you have finished executing statements, you should call connections.commit(). Alternately, you may call connections.rollback() if you need to drop your changes.
If you're using a store that supports implicit transactions (also sometimes called "autocommit"), you can skip calling start by setting the connections.implicit_trans attribute to True (it's False by default). This can be done in code or config. See the comparison chart for details.
[1] Python Cookbook,
Binding Constants at compile time