Five Common Django Mistakes
Table of Contents
Django is a fantastic framework for building web applications and takes a batteries included approach. With this approach, comes complexity. When you’re starting out with Django, you can introduce subtle bugs due to lack of knowledge. I wrote this post partially for myself to reference in my professional development, but also to help illuminate some mistakes that I see. As with anything, there are tradeoffs with this approach. Django contains a lot of great code that you would have to write for every web application and reinvent the wheel, but when you don’t fully understand the framework, you can write code with unintended effects. For this post, we’ll develop an example Django application that handles employee management for various organizations.
from django.contrib.auth import get_user_model from django.core.exceptions import ValidationError from django.db import models User = get_user_model() class Organization(models.Model): name = models.CharField(max_length=100) datetime_created = models.DateTimeField(auto_now_add=True, editable=False) is_active = models.BooleanField(default=True) class Employee(models.Model): user = models.ForeignKey( User, on_delete=models.CASCADE, related_name="employees" ) organization = models.ForeignKey( Organization, on_delete=models.CASCADE, related_name="employees" ) is_currently_employed = models.BooleanField(default=True) reference_id = models.CharField(null=True, blank=True, max_length=255) last_clock_in = models.DateTimeField(null=True, blank=True) datetime_created = models.DateTimeField(auto_now_add=True, editable=False) def clean(self): try: if self.last_clock_in < self.datetime_created: raise ValidationError( "Last clock in must occur after the employee entered" " the system." ) except TypeError: # Raises TypeError if there is no last_clock_in because # you cant compare None to datetime pass
Not using select_related and prefetch_related
Let’s say we write some code that loops over every active organization’s employees.
for org in Organization.objects.filter(is_active=True): for emp in org.employees.all(): if emp.is_currently_employed: do_something(org, emp)
This loop results in a query for every employee in our database. This could lead to tens of thousands of queries, which
slows down our application. However, if we add a
prefetch_related to the organization query, we minimize the amount
for org in Organization.objects.filter(is_active=True).prefetch_related( "employees" ):
Adding these methods results in large performance improvements without much work, but adding them can be easy to forget.
select_related. For a reverse
ForeignKey or a
ManyToManyField, use prefetch_related.
We could make this more efficient by starting at the employee table and using the database to filter out results. We still
need to add
select_related since the function
do_something uses the employee’s organization. If we don’t, the loop
can result in thousands of queries to the organization table.
for emp in Employee.objects.filter( organization__is_active=True, is_currently_employed=True ).select_related("organization"): do_something(emp.organization, emp)
Adding null to a CharField or TextField
Django’s documentation recommends not adding
null=True to a
CharField. Looking at our example code, the employee’s reference id contains
null=True. In our
example application, we optionally integrate with our customer’s employee tracking system, and we use
the integrated system’s id.
reference_id = models.CharField(null=True, blank=True, max_length=255)
null=True means the field has two “no data” values, null and the empty string. Conventionally, Django uses the
empty string to represent no data. By also having null as a “no data” value, we can introduce subtle bugs. Let’s say we
need to write some code to fetch data from our customer’s system.
if employee.reference_id is not None: fetch_employee_record(employee)
Ideally, you write the if statement to handle any “no data” values by using
if employee.reference_id:, but I’ve
found this does not happen in practice. Since a
reference_id could be either null or the empty string, we’ve created
a bug here where the system attempts to fetch the employee record if the
reference_id is the empty string. Clearly,
this doesn’t work and will cause an error in our system. According to Django’s documentation, one exception exists for
null=True to a
CharField. When you need to add both
unique=True to a
null=True is required.
Descending versus ascending with order_by or latest
Django’s order_by defaults to ascending order.
By adding a
- in front of your keywords, you tell Django to provide descending order. Let’s look at an example.
oldest_organization_first = Organization.objects.order_by("datetime_created") newest_organization_first = Organization.objects.order_by("-datetime_created")
With the minus in front of
datetime_created, Django gives us the newest organization first. Conversely without the minus,
we get the oldest organization first. Mistaking the default ascending order can result in very subtle bugs. Django
querysets also come with latest, which gives us
the latest object in a table based on the passed keyword fields. The
latest method defaults to descending order as
order_by which defaults to ascending order.
oldest_organization_first = Organization.objects.latest("-datetime_created") newest_organization_first = Organization.objects.latest("datetime_created")
In multiple projects, I have introduced bugs because of the different defaults between
latest queries with caution. Let’s look at equivalent queries using
>>> oldest_org = Organization.objects.order_by("datetime_created")[:1] >>> oldest_other_org = Organization.objects.latest("-datetime_created") >>> oldest_org == oldest_other_org True >>> newest_org = Organization.objects.order_by("-datetime_created")[:1] >>> newest_other_org = Organization.objects.latest("datetime_created") >>> newest_org == newest_other_org True
Forgetting that clean is not called on save
According to Django’s documentation,
model validation methods, such as
clean_fields, do not get called automatically by a
save method. In our example code, the employee model contains a clean method that says the
shouldn’t occur before the employee entered the system.
def clean(self): try: if self.last_clock_in < self.datetime_created: raise ValidationError( "Last clock in must occur after the employee entered" " the system." ) except TypeError: # Raises TypeError if there is no last_clock_in because # you cant compare None to datetime pass
Let’s say we have a view that updates the
last_clock_in time for our employee, and as a part of that view, we update
the employee by calling
from django.http import HttpResponse from django.shortcuts import get_object_or_404 from django.views.decorators.http import require_http_methods from example_project.helpers import parse_request from example_project.models import Employee @require_http_methods(["POST"]) def update_employee_last_clock_in(request, employee_pk): clock_in_datetime = parse_request(request) employee = get_object_or_404(Employee, pk=employee_pk) employee.last_clock_in = clock_in_datetime employee.save() return HttpResponse(status=200)
In our example view, we call
save without calling
full_clean, meaning that the
into our view could occur before the employee’s
datetime_created and still be saved into the database. This results in
invalid data entering our database. Let’s fix our bug.
employee.last_clock_in = clock_in_datetime employee.full_clean() employee.save()
Now if the
clock_in_datetime comes before the employee’s
full_clean raises a
which prevents the invalid data from entering our database.
Not including update_fields on save
Django Model’s save method includes a
keyword argument called update_fields.
In a typical production set up for Django, people use gunicorn to run multiple
Django server processes on the same machine and use celery to run background
processes. When you call
update_fields, the entire model updates with the values in memory. Let’s look
at the actual SQL to illustrate.
>>> user = User.objects.get(id=1) >>> user.first_name = "Steven" >>> user.save() UPDATE "users_user" SET "password" = 'some_hash', "last_login" = '2021-02-25T22:43:41.033881+00:00'::timestamptz, "is_superuser" = false, "username" = 'stevenapate', "first_name" = 'Steven', "last_name" = '', "email" = 'firstname.lastname@example.org', "is_staff" = false, "is_active" = true, "date_joined" = '2021-02-19T21:08:50.885795+00:00'::timestamptz, WHERE "users_user"."id" = 1 >>> user.first_name = "NotSteven" >>> user.save(update_fields=["first_name"]) UPDATE "users_user" SET "first_name" = 'NotSteven' WHERE "users_user"."id" = 1
The first call to
update_fields results in every field on the user model being saved. With
first_name updates. In a production environment with frequent writes, calling
update_fields can result in a race condition. Let’s say we have two processes running, a gunicorn worker running our
Django server and a celery worker. On a set schedule, the celery worker queries an external API and potentially updates
from celery import task from django.contrib.auth import get_user_model from example_project.external_api import get_user_status User = get_user_model() @task def update_user_status(user_pk): user = User.objects.get(pk=user_pk) user_status = get_user_status(user) if user_status == "inactive": user.is_active = False user.save()
The celery worker starts the task, loads the entire user object into memory, and queries the external API,
but the external API takes longer than expected. While the celery worker waits for the external API, the same user
connects to our gunicorn worker and submits an update to their email, changing it from email@example.com to
firstname.lastname@example.org. After the email update commits to the database, the external API responds, and the celery
worker updates the user’s
In this case, the celery worker overwrites the email update because the worker loaded the entire user object into memory
before the email update committed. When the celery worker loaded the user into memory, the user’s email is
email@example.com. This email sits in memory until the external API responds and overwrites the email update. In the end,
the row representing the user inside the database contains the old email address, firstname.lastname@example.org, and
is_active=False. Let’s change the code to prevent this race condition.
if user_status == "inactive": user.is_active = False user.save(update_fields=["is_active"])
If the previous scenario occurred with the updated code, the user’s email remains email@example.com after the
celery worker updates
is_active because the update only writes the
is_active field. Only on rare occasions, such as
creating a new object, should you call
update_fields. For more reading on this topic, I recommend
this ticket on the Django bug tracker and
this pull request for Django REST Framework.
While you can solve this problem in your code base by not calling a plain
save method, third party Django packages might
contain this issue. For instance, Django REST Framework
does not use
update_fields on PATCH requests. Django REST Framework is a fantastic package that I love to use, but it
does not handle this issue. Keep this in mind when adding third party packages to your Django project.
I’ve made all of these mistakes, some multiple times. I hope this post sheds light on potential bugs and helps you prevent these mistakes. I enjoy using Django, and I think it’s a great framework for building web applications. However, with any big framework, the complexity can obfuscate, and until you know what to look out for, mistakes can be made.