In this post I will try to address an issue of performance regarding heavy calculations, data processing or other tasks that happen in run time but could be put off for later.
How Stuff Works
Sites usually function quite well, there are a couple of SQL queries, hopefully some memcached /redis hits and a bit of string manipulation to put data into templates and serve them to the user. But from time to time, a new user registers, and that results in processing and sending the registration email, possibly collecting his Facebook image and storing it locally, getting all his Facebook friends or last dozen of tweets, and all this takes some time. Time the user waits and looks at the spinner, wondering if something has gone wrong and whether he should close the site.
Another example is a situation of multiple image uploads at the same time. Application can take the entire server down if it processes images immediately. If, say, 100 people decide to upload an image at the same time, you have just created 100 processes (or more if you’re creating multiple variations of images) that are using as much of the CPU as possible.
What Can Be Done
To offload the process, you can put it in a queue. The most simple one is a cron job: after user registers, you enter data in the database with the flag “send_email=1″. Now, when the cron runs (every minute? every hour?), it collects all the new users who need to receive the email and it sends them one. Easy, right?
Lets stop for a second and take a closer look. What happens if there are 10 or 1000 tasks that happen in one minute? High volume traffic sites with lots of image uploads, for example. How long do we need to wait for our image to show up if 100 images are already waiting to be processed, and cron sits there, doing nothing for one whole minute?
What happens if we have a bug in our file run by cron that never proceeds to the next task, or if there is an exception that is not being handled? Cron will die and all those people will not get their email. Bad, huh? Cron tasks are OK for regular maintenance, but choose the right tool for the job.
Gearman, Beanstalkd, Resque & QuTee
There are a couple of solutions that allow us to do the heavy lifting in the background.
Gearman is one of the most popular job queue processors. It provides APIs for a lot of popular platforms (Java, Python, PHP, Perl). You can use it for background jobs as well as messaging another service to do the work at the run time (say, your PHP application asks a program written in C that runs on another server to do some task better suited for C language…).
Another popular queuing system is Beanstalkd, created originally to power the backend for the ‘Causes’ Facebook app. List of supported client libraries is impressive, and almost every major language is supported.
As a side project, I have started to work on QuTee, a queue manager and task processor for PHP. To be feature complete, I have set these goals:
- it has to have a good API to be easy to use,
- it has to have as few dependencies as possible,
- it has to be easily installed / configured, and with multiple backends in mind, it has to solve the background task processing on dedicated machines as well as shared hostings,
- it has to provide some kind of interface to monitor the tasks status and start/stop workers.
Currently only Redis is supported, and supervisord is advised for maintaining worker processes (since I didn’t want to go into forking and adding another dependency).
Adding the task to the queue is really easy. After creating and configuring the queue, adding a task can be a oneliner:
[prism key=”code-config” language=”php”]
In the above example, we have a class
Acme namespace that implements the
TaskInterface. When the worker instantiates the task, it knows how to
->run() the task. However, there is no requirement for the task class to be a contractor of
TaskInterface, but then we need to specify what method to run:
[prism key=”code-enque-task” language=”php”]
Tasks can have a unique identifier, so one task will not run multiple times. Workers are even easier, they only listen to queues (or one in specific) and run tasks, nothing else:
[prism key=”code-start-worker” language=”php”]
This example creates a worker that polls the queue every 30 seconds, and is interested only in tasks from the high priority queue.
Since QuTee is 0.7.0 right now, it lacks some functionality (task status web interface / logging / more backends) but can be used for background job processing.
When the code that is of importance to the task or worker changes, one needs to restart the worker. This is a thing to keep in mind when pushing a fix to production. To get around this problem, one could exit from the worker every hour or so, letting the supervisord to restart it.
Another thing to keep in mind about background workers are racing conditions and how to evade them. Lets say we have 100 workers, and have created a task to send newsletter to 1000 users with unique coupon. If each task selects the first coupon from the database at the same time, there is a good chance that a good number of users will get the same coupon (say you didn’t implement the read locking). For that reason it is a good practice to fetch all the necessary data when creating the task, so the task is as stateless as it can be, and that coupon is sent with the rest of the data.
One last thing that comes to mind are multiple database connections. When you start the worker, keep in mind that, if your tasks need a database connection, each task will connect to a database, perform the job and quit. Remember to
mysqli_close() the connection since PHPs garbage collection isn’t that good.
Background workers and job queue greatly improve user experience and help reduce overall server load. If you didn’t use background workers up until now because of the hassle of setting everything up, I hope this article has given you some idea where to start and will motivate you to consider using QuTee because of its quick setup and, moreover, because it will develop excellent features, so go ahead and fork it :)