Using Markdown in Django
Originally published at https://hakibenita.com on February 11, 2020.
Read this article on by blog with proper syntax highlighting.
As developers, we rely on static analysis tools to check, lint and transform our code. We use these tools to help us be more productive and produce better code. However, when we write content using markdown the tools at our disposal are scarce.
In this article we describe how we developed a Markdown extension to address challenges in managing content using Markdown in Django sites.
Like every website, we have different types of (mostly) static content in places like our home page, FAQ section and “About” page. For a very long time, we managed all of this content directly in Django templates.
When we finally decided it’s time to move this content out of templates and into the database, we thought it’s best to use Markdown. It’s safer to produce HTML from Markdown, it provides a certain level of control and uniformity, and is easier for non-technical users to handle. As we progressed with the move, we noticed we are missing a few things:
Internal Links
Links to internal pages can get broken when the URL changes. In Django templates and views we use reverse
and {% url %}
, but this is not available in plain Markdown.
Copy Between Environments
Absolute internal links cannot be copied between environments. This can be resolved using relative links, but there is no way to enforce this out of the box.
Invalid Links
Invalid links can harm user experience and cause the user to question the reliability of the entire content. This is not something that is unique to Markdown, but HTML templates are maintained by developers who know a thing or two about URLs. Markdown documents on the other hand, are intended for non-technical writers.
When I was researching this issue I searched for Python linters, Markdown preprocessor and extensions to help produce better Markdown. I found very few results. One approach that stood out was to use Django templates to produce Markdown documents.
Preprocess Markdown using Django Template
Using Django templates, you can use template tags such as to reverse URL names, as well as conditions, variables, date formats and all the other Django template features. This approach essentially uses Django template as a preprocessor for Markdown documents.
I personally felt like this may no be the best solution for non-technical writers. In addition, I was worried that providing access to Django template tags might be dangerous.
With a better understanding of the problem, we were ready to dig a bit deeper into Markdown in Python.
To start using Markdown in Python, install the package:
$ pip install markdown
Collecting markdown
Installing collected packages: markdown
Successfully installed markdown-3.2.1
Next, create a Markdown
object and use the function convert
to turn some Markdown into HTML:
>>> import markdown
>>> md = markdown.Markdown()
>>> md.convert("My name is **Haki**")
<p>My name is <strong>Haki</strong></p>
You can now use this HTML snippet in your template.
The basic Markdown processor provides the essentials for producing HTML content. For more “exotic” options, the Python markdown
package includes some built-in extensions. A popular extension is the "extra" extension that adds, among other things, support for fenced code blocks:
>>> import markdown
>>> md = markdown.Markdown(extensions=['extra'])
>>> md.convert("""```python
... print('this is Python code!')
... ```""")
<pre><code class="python">print(\'this is Python code!\')\n</code></pre>
To extend Markdown with our unique Django capabilities, we are going to develop an extension of our own.
If you look at the source, you’ll see that to convert markdown to HTML, Markdown
uses different processors. One type of processor is an inline processor. Inline processors match specific inline patterns such as links, backticks, bold text and underlined text, and converts them to HTML.
The main purpose of our Markdown extension is to validate and transform links. So, the inline processor we are most interested in is the LinkInlineProcessor
. This processor takes markdown in the form of [Haki's website](https://hakibenita.com)
, parses it and returns a tuple containing the link and the text.
To extend the functionality, we extend LinkInlineProcessor
and create a Markdown.Extension
that uses it to handle links:
import markdown
from markdown.inlinepatterns import LinkInlineProcessor, LINK_RE
def get_site_domain() -> str:
# TODO: Get your site domain here
return 'example.com'
def clean_link(href: str, site_domain: str) -> str:
# TODO: This is where the magic happens!
return href
class DjangoLinkInlineProcessor(LinkInlineProcessor):
def getLink(self, data, index):
href, title, index, handled = super().getLink(data, index)
site_domain = get_site_domain()
href = clean_link(href, site_domain)
return href, title, index, handled
class DjangoUrlExtension(markdown.Extension):
def extendMarkdown(self, md, *args, **kwrags):
md.inlinePatterns.register(DjangoLinkInlineProcessor(LINK_RE, md), 'link', 160)
Let’s break it down:
- The extension
DjangoUrlExtension
registers an inline link processor calledDjangoLinkInlineProcessor
. This processor will replace any other existing link processor. - The inline processor
DjangoLinkInlineProcessor
extends the built-inLinkInlineProcessor
, and calls the functionclean_link
on every link it processes. - The function
clean_link
receives a link and a domain, and returns a transformed link. This is where we are going to plug in our implementation.
How to get the site domain
To identify links to your own site you must know the domain of your site. If you are using Django’s sites framework you can use it to get the current domain.
I did not include this in my implementation because we don’t use the sites framework. Instead, we set a variable in Django settings.
Another way to get the current domain is from an HttpRequest
object. If content is only edited in your own site, you can try to plug the site domain from the request object. This may require some changes to the implementation.
To use the extension, add it when you initialize a new Markdown
instance:
>>> md = markdown.Markdown(extensions=[DjangoUrlExtension()])
>>> md.convert("[haki's site](https://hakibenita.com)")
<p><a href="https://hakibenita.com">haki\'s site</a></p>
Great, the extension is being used and we are ready for the interesting part!
Now that we got the extension to call clean_link
on all links, we can implement our validation and transformation logic.
To get the ball rolling, we’ll start with a simple validation. mailto
links are useful for opening the user's email client with a predefined recipient address, subject and even message body.
A common mailto
link can look like this:
<a href="mailto:support@service.com?subject=I need help!">Help!</a>
This link will open your email client set to compose a new email to “support@service.com” with subject line “I need help!”.
mailto
links do not have to include an email address. If you look at the "share" buttons at the bottom of this article, you'll find a mailto
link that looks like this:
<a
href="mailto:?subject=Django Markdown by Haki Benita&body=http://hakibenita.com/django-markdown"
title="Email">
Share via Email
</a>
This mailto
link does not include a recipient, just a subject line and message body.
Now that we have a good understanding of what mailto
links look like, we can add the first validation to the clean_link
function:
from typing import Optional
import re
from django.core.exceptions import ValidationError
from django.core.validators import EmailValidator
class Error(Exception):
pass
class InvalidMarkdown(Error):
def __init__(self, error: str, value: Optional[str] = None) -> None:
self.error = error
self.value = value
def __str__(self) -> str:
if self.value is None:
return self.error
return f'{self.error} "{self.value}"';
def clean_link(href: str, site_domain: str) -> str:
if href.startswith('mailto:'):
email_match = re.match('^(mailto:)?([^?]*)', href)
if not email_match:
raise InvalidMarkdown('Invalid mailto link', value=href)
email = email_match.group(2)
if email:
try:
EmailValidator()(email)
except ValidationError:
raise InvalidMarkdown('Invalid email address', value=email)
return href
# More validations to come...
return href
To validate a mailto
link we added the following code to clean_link
:
- Check if the link starts with
mailto:
to identify relevant links. - Split the link to its components using a regular expression.
- Yank the actual email address from the
mailto
link, and validate it using Django'sEmailValidator
.
Notice that we also added a new type of exception called InvalidMarkdown
. We defined our own custom Exception
type to distinguish it from other errors raised by markdown
itself.
Custom error class: I wrote about custom error classes in the past, why they are useful and when you should use them.
Before we move on, let’s add some tests and see this in action:
>>> md = markdown.Markdown(extensions=[DjangoUrlExtension()])
>>> md.convert("[Help](mailto:support@service.com?subject=I need help!)")
'<p><a href="mailto:support@service.com?subject=I need help!">Help</a></p>'
>>> md.convert("[Help](mailto:?subject=I need help!)")
<p><a href="mailto:?subject=I need help!">Help</a></p>
>>> md.convert("[Help](mailto:invalidemail?subject=I need help!)")
InvalidMarkdown: Invalid email address "invalidemail"
Great! Worked as expected.
Now that we got our toes wet with mailto
links, we can handle other types of links:
External Links
- Links outside our Django app.
- Must contains a scheme: either http or https.
- Ideally, we also want to make sure these links are not broken, but we won’t do that now.
Internal Links
- Links to pages inside our Django app.
- Link must be relative: this will allow us to move content between environments.
- Use Django’s URL names instead of a URL path: this will allow us to safely move views around without worrying about broken links in markdown content.
- Links may contain query parameters (
?
) and a fragment (#
).
SEO: From an SEO standpoint, public URL’s should not change. When they do, you should handle it properly with redirects, otherwise you might get penalized by search engines.
With this list of requirements we can start working.
To link to internal pages we want writers to provide a URL name, not a URL path. For example, say we have this view:
from django.urls import path
from app.views import home
urlpatterns = [
path('', home, name='home'),
]
The URL path to this page is https://example.com/
, the URL name is home
. We want to use the URL name home
in our markdown links, like this:
Go back to [homepage](home)
This should render to:
We also want to support query params and hash:
Go back to [homepage](home#top) Go back to [homepage](home?utm_source=faq)
This should render to the following HTML:
<p>Go back to <a href="/#top">homepage</a></p>
<p>Go back to <a href="/?utm_source=faq">homepage</a></p>
Using URL names, if we change the URL path, the links in the content will not be broken. To check if the href provided by the writer is a valid url_name
, we can try to it:
>>> from django.urls import reverse
>>> reverse('home')
'/'
The URL name “home” points to the url path “/”. When there is no match, an exception is raised:
>>> from django.urls import reverse
>>> reverse('foo')
NoReverseMatch: Reverse for 'foo' not found.
'foo' is not a valid view function or pattern name.
Before we move forward, what happens when the URL name include query params or a hash:
>>> from django.urls import reverse
>>> reverse('home#top')
NoReverseMatch: Reverse for 'home#top' not found.
'home#top' is not a valid view function or pattern name.
>>> reverse('home?utm_source=faq')
NoReverseMatch: Reverse for 'home?utm_source=faq' not found.
'home?utm_source=faq' is not a valid view function or pattern name.
This makes sense because query parameters and hash are not part of the URL name.
To use reverse
and support query params and hashes, we first need to clean the value. Then, check that it is a valid URL name and return the URL path including the query params and hash, if provided:
import re
from django.urls import reverse
def clean_link(href: str, site_domain: str) -> str:
# ... Same as before ...
# Remove fragments or query params before trying to match the URL name.
href_parts = re.search(r'#|\?', href)
if href_parts:
start_ix = href_parts.start()
url_name, url_extra = href[:start_ix], href[start_ix:]
else:
url_name, url_extra = href, ''
try:
url = reverse(url_name)
except NoReverseMatch:
pass
else:
return url + url_extra
return href
This snippet uses a regular expression to split href
in the occurrence of either ?
or #
, and return the parts.
Make sure that it works:
>>> md = markdown.Markdown(extensions=[DjangoUrlExtension()])
>>> md.convert("Go back to [homepage](home)")
<p>Go back to <a href="/">homepage</a></p>
>>> md.convert("Go back to [homepage](home#top)")
<p>Go back to <a href="/#top">homepage</a></p>
>>> md.convert("Go back to [homepage](home?utm_source=faq)")
<p>Go back to <a href="/?utm_source=faq">homepage</a></p>
>>> md.convert("Go back to [homepage](home?utm_source=faq#top)")
<p>Go back to <a href="/?utm_source=faq#top">homepage</a></p>
Amazing! Writers can now use URL names in Markdown. They can also include query parameters and fragment to be added to the URL.
To handle external links properly we want to check two things:
- External links always provide a scheme, either
http:
orhttps:
. - Prevent absolute links to our own site. Internal links should use URL names.
So far, we handled URL names and mailto
links. If we passed these two checks it means href
is a URL. Let's start by checking if the link is to our own site:
from urllib.parse import urlparse
def clean_link(href: str, site_domain: str) -> str:
parsed_url = urlparse(href)
if parsed_url.netloc == site_domain:
# TODO: URL is internal.
The function returns a named tuple that contains the different parts of the URL. If the netloc
property equals the site_domain
, the link is really an internal link.
If the URL is in fact internal, we need to fail. But, keep in mind that writers are not necessarily technical people, so we want to help them out a bit and provide a useful error message. We require that internal links use a URL name and not a URL path, so it’s best to let writers know what is the URL name for the path they provided.
To get the URL name of a URL path, Django provides a function called resolve
:
>>> from django.utils import resolve
>>> resolve('/')
ResolverMatch(
func=app.views.home,
args=(),
kwargs={},
url_name=home,
app_names=[],
namespaces=[],
route=,
)
>>> resolve('/').url_name
'home'
When a match is found, resolve
returns a ResolverMatch
object that contains, among other information, the URL name. When a match is not found, it raises an error:
>>> resolve('/foo')
Resolver404: {'tried': [[<URLPattern '' [name='home']>]], 'path': 'foo'}
This is actually what Django does under the hood to determine which view function to execute when a new request comes in.
To provide writers with better error messages we can use the URL name from the ResolverMatch
object:
from urllib.parse import urlparse
def clean_link(href: str, site_domain: str) -> str:
# ...
parsed_url = urlparse(href)
if parsed_url.netloc == site_domain:
try:
resolver_match = resolve(parsed_url.path)
except Resolver404:
raise InvalidMarkdown(
"Should not use absolute links to the current site.\n"
"We couldn't find a match to this URL. Are you sure it exists?",
value=href,
)
else:
raise InvalidMarkdown(
"Should not use absolute links to the current site.\n"
'Try using the url name "{}".'.format(resolver_match.url_name),
value=href,
)
return href
When we identify that the link in internal, we handle two cases:
- We don’t recognize the URL: The url is most likely incorrect. Ask the writer to check the URL for mistakes.
- We recognize the URL: The url is correct so tell the writer what URL name to use instead.
Let’s see it in action:
>>> clean_link('https://example.com/', 'example.com')
InvalidMarkdown: Should not use absolute links to the current site.
Try using the url name "home". "https://example.com/"
>>> clean_link('https://example.com/foo', 'example.com')
InvalidMarkdown: Should not use absolute links to the current site.
We couldn't find a match to this URL.
Are you sure it exists? "https://example.com/foo"
>>> clean_link('https://external.com', 'example.com')
'https://external.com'
Nice! External links are accepted and internal links are rejected with a helpful message.
The last thing we want to do is to make sure external links include a scheme, either http:
or https:
. Let's add that last piece to the function clean_link
:
def clean_link(href: str, site_domain: str) -> str:
# ...
parsed_url = urlparse(href)
#...
if parsed_url.scheme not in ('http', 'https'):
raise InvalidMarkdown(
'Must provide an absolute URL '
'(be sure to include https:// or http://)',
href,
)
return href
Using the parsed URL we can easily check the scheme. Let’s make sure it’s working:
>>> clean_link('external.com', 'example.com')
InvalidMarkdown: Must provide an absolute URL (be sure to include https:// or http://) "external.com"
We provided the function with a link that has no scheme, and it failed with a helpful message. Cool!
Putting it All Together
This is the complete code for the clean_link
function:
def clean_link(href: str, site_domain: str) -> str:
if href.startswith('mailto:'):
email_match = re.match(r'^(mailto:)?([^?]*)', href)
if not email_match:
raise InvalidMarkdown('Invalid mailto link', value=href)
email = email_match.groups()[-1]
if email:
try:
EmailValidator()(email)
except ValidationError:
raise InvalidMarkdown('Invalid email address', value=email)
return href
# Remove fragments or query params before trying to match the url name
href_parts = re.search(r'#|\?', href)
if href_parts:
start_ix = href_parts.start()
url_name, url_extra = href[:start_ix], href[start_ix:]
else:
url_name, url_extra = href, ''
try:
url = reverse(url_name)
except NoReverseMatch:
pass
else:
return url + url_extra
parsed_url = urlparse(href)
if parsed_url.netloc == site_domain:
try:
resolver_match = resolve(parsed_url.path)
except Resolver404:
raise InvalidMarkdown(
"Should not use absolute links to the current site.\n"
"We couldn't find a match to this URL. Are you sure it exists?",
value=href,
)
else:
raise InvalidMarkdown(
"Should not use absolute links to the current site.\n"
'Try using the url name "{}".'.format(resolver_match.url_name),
value=href,
)
if parsed_url.scheme not in ('http', 'https'):
raise InvalidMarkdown(
'Must provide an absolute URL '
'(be sure to include https:// or http://)',
href,
)
return href
To get a sense of what a real use case for all of these features look like, take a look at the following content:
# How to Get Started?
Download the [mobile app](https://some-app-store.com/our-app) and log in to your account.
If you don't have an account yet, [sign up now](signup?utm_source=getting_started).
For more information about pricing, check our [pricing plans](home#pricing-plans)
This will produce the following HTML:
<h1>How to Get Started?</h1>
<p>Download the <a href="https://some-app-store.com/our-app">mobile app</a> and log in to your account.
If you don't have an account yet, <a href="signup/?utm_source=getting_started">sign up now</a>.
For more information about pricing, check our <a href="/#pricing-plans">pricing plans</a></p>
Nice!
We now have a pretty sweet extension that can validate and transform links in Markdown documents! It is now much easier to move documents between environments and keep our content tidy and most importantly, correct and up to date!
Source: The full source code can be found in this gist.
The capabilities described in this article worked well for us, but you might want to adjust it to fit your own needs.
If you need some ideas, then in addition to this extension we also created a markdown Preprocessor that lets writers use constants in Markdown. For example, we defined a constant called SUPPORT_EMAIL
, and we use it like this:
Contact our support at [$SUPPORT_EMAIL](mailto:$SUPPORT_EMAIL)
The preprocessor will replace the string $SUPPORT_EMAIL
with the text we defined, and only then render the Markdown.
Originally published at https://hakibenita.com on March 30, 2020.