Everybody working in web development knows about this, right?
I realized recently that it was far from being true. While frameworks and ORMs make web developers' job easier and easier, more and more people comes to develop websites without knowing about the real basics. They use relational databases without knowing anything about SQL, and they develop websites without understanding how the client will talk with their website.
I don't pretend to make a revolution there, but if I bring my car to the car mechanic, I hope he will know a bit more about it than "engine makes power, and so the car is working". So are my exigences when I ask a specialist to work on a website.
To give a parallel with playing music, you can learn to play some songs without knowing anything about harmony and rhythm, but is you want to get serious about what you play, it will become an absolute requirement.
I so decided to make some short articles explaining in simple terms what those basis are, to my humble opinion. First one is about HTTP.
So what is HTTP ?
HTTP is a communication protocol that comes at application level. In easier terms, it means that it's a language that allows two applications to communicate by exchanging messages. How those messages will transit between the two applications is not HTTP purpose, and in a network-less world, you can even imagine exchanging messages on floppy drives. In the real world, we prefer make it transit over networks, using some low levels network protocols like TCP/IP.
As it's true 99% of the time, we will assume it's always true: HTTP is the communication language, TCP/IP will handle the transport of HTTP messages between actors.
Let's go back to the original definition.
HTTP 1.1 (the last protocol version) is defined by RFC 2616, which says: The Hypertext Transfer Protocol (HTTP) is an application-level protocol for distributed, collaborative, hypermedia information systems. HTTP has been in use by the World-Wide Web global information initiative since 1990.
The HTTP protocol is a request/response protocol. A client sends a request to the server in the form of a request method, URI, and protocol version, followed by a MIME-like message containing request modifiers, client information, and possible body content over a connection with a server. The server responds with a status line, including the message's protocol version and a success or error code, followed by a MIME-like message containing server information, entity metainformation, and possible entity-body content.
This is the absolute minimum to know. HTTP is a communication language, and works in a request-message, response-message way. This is extremely simple. The RFC then explain that it can do much more complicated things, like handling messages that will go through many nodes before reaching destination, but let's keep it simple, and say that from a client or server point of view, it will result in exactly the same thing.
The RFC also makes clear that HTTP is not dependent on the transport mean:
HTTP communication usually takes place over TCP/IP connections. The default port is TCP 80, but other ports can be used. This does not preclude HTTP from being implemented on top of any other protocol on the Internet, or on other networks. HTTP only presumes a reliable transport; any protocol that provides such guarantees can be used; the mapping of the HTTP/1.1 request and response structures onto the transport data units of the protocol in question is outside the scope of this specification.
And it will be out of the scope of this blog post too. HTTP doesn't care about how the message transport is done, but it's usually handled by TCP/IP.
Messages, requests, responses
So the basic communication tool will be messages. A client to server message is called a request, and the server to client "answer" is called a response. Even if it may seems obvious, a response cannot be the server initiative: a client must have made a request, and server response is the response to this request.
Now we're pretty confident about HTTP messages being Tupperware boxes containing informations. Let's make a step forward and look at how this box content is structured.
Messages consists of three parts: the start line, the headers, and the body. As each part has specific rules depending on if the message is a request or a response, we'll have a look at both separately.
The request start line is pretty easy to understand, it will tell the server what we are requesting.
GET /path/to/file/index.html HTTP/1.0
Here it tells we're using HTTP GET method, to retrieve resource located at the /path/to/file/index.html URI (uniform resource identifier). It also contains what protocol variant we're using, to be sure we speak the exact same language.
I will come back a bit later on what are HTTP methods, and will refer them as "methods" from now on.
After this, the client will send the headers. Headers are a key-value list of properties, that transmit various informations about the client, it's capabilities, what it's expecting, and the message itself. I won't go over every standard request headers.
Depending on what method is used, the request's body may or may not contain data. For example, if we simply get google's homepage, it won't contain any data. If we're uploading a file, or POST-ing form data to the server, the body will be those data container.
Once server received the request, it will pass it to a handler, which can be about anything. In PHP applications, hosted on an apache server, chances are that apache will call mod_php to handle the request, and build a response it will send back to the client. But that's only one possibility amongst a huge infinite list, which is outside the scope of this blog post.
Anyway, the handler takes an http request as its input, and builds an http response, starting by the response start line.
HTTP/1.0 200 OK
This is a pretty standard response. Server starts by saying which protocol variant it's using, then sends a computer readable status code, followed by an human readable translation of this status code.
A complete list of HTTP status codes is available at many places on the internet, and it's useless to know them all. Googling "http status code" when you're looking for the adequate status should be far enough.
The most important thing to know is the different status code classes, defined by the first digit of the code:
- 1xx: Informational - Request received, continuing process.
- 2xx: Success - The action was successfully received, understood, and accepted.
- 3xx: Redirection - Further action must be taken in order to complete the request.
- 4xx: Client Error - The request contains bad syntax or cannot be fulfilled.
- 5xx: Server Error - The server failed to fulfill an apparently valid request.
Some practical direct application of this is when you need to choose between sending back a 404 error and a 500 error. 404 is client's fault. He asked for something that does not exist, and you tell him so. No problem. 500 means that server tried hard to build a response, but something forbid it to achieve its goal. This is a server-side application problem, and you should monitor, and correct this as soon as possible.
After sending the start line, the response, being a faithful HTTP message, will go on by sending the headers. Like for the request, headers will give informations about the server and the message content. You can consider it as the README file that will help the user agent (http client software) to understand the response.
What is it? Beside being one word in the request start line, the HTTP method is the client's way to tell the server what kind of operation it wants to perform on the resource requested. The method used in a request have direct implications on what kind of work you can do server-side, and should be choosen carefully (of course, you can break this rule, but it's a very bad idea).
Some methods are said "safe". This means that request is a retrieval operation, and should not do any other work. Notice that "should" does not mean "must", but user cannot be held responsible for breaking something if the method used was "safe".
For example, you should not allow GET methods to create, delete or change files, database records, or LDAP nodes (non exhaustive, of course).
This is the case of GET and HEAD methods.
Idempotents methods are those which won't have any side-effect if repeated more than once.
For example, the DELETE method, which asks for a resource deletion, should have the exact same effect if it's called only once on resource A, or if it's called ten times. The overall effect should be the resource deletion (if allowed), not less, not more.
This is the case of GET, HEAD, PUT and DELETE methods.
Non-safe and non-idempotent methods
But of course, some requests needs to take risky and/or non-repeatable actions server side. When an user agent uses such a method, he will be aware of the risks.
For example, POST-ing a comment will create a database record, and if you make twice the same request (unless some protection has been developed to prevent this), you will actually create two database records.
It's the exact reason why browsers ask for user confirmation before repeating a POST request, if user asks for a refresh. It's also the reason why you should redirect to a safe idempotent resource after a successful form submission, to forbid user to be able to repeat request.
This is the case of POST method.
And the real world
All those methods are defined by the HTTP RFC, but the real world is much more restricted than that. In fact, modern browsers only support GET and POST methods, so you should only care about those.
You should just keep in mind that other HTTP methods exist, and for example HTTP extensions like WebDav (which is used by microsoft and their "online folders") use them. A simple way to try out PUT, DELETE, or any other method is to use curl as an http client.
To learn more about specificities of all methods, you can still refer to the original RFC.
GET /index.html HTTP/1.1 Host: www.example.com
Nothing complicated here, but for curious people (and you should be curious):
- example.com is a special domain reserved by the RFC 2606. This domain won't bring any conflict with a real domain, never. You should use it for all your tests and examples.
- The "Host" header is required by HTTP 1.1 protocol, and is used by web servers to provide name based virtual hosting of different websites on the same IP address.
and the response!
HTTP/1.1 200 OK Date: Sun, 24 Jan 2010 22:38:34 GMT Server: Apache (Boubounetout/Linux) Last-Modified: Fri, 01 Jan 2010 23:11:55 GMT Etag: "f3f80-1b6-de1b034b" Accept-Ranges: bytes Content-Length: 438 Connection: close Content-Type: text/html; charset=UTF-8 <html> <head></head> <body> <h1>Welcome</h1> <p>Hello, world!</p> </body> </html>
I hope that I'm far from being alone claiming that this is the very basis of any serious web development, and that it will makes your web developer life easier just because you understand the underlying communication language. Yes, HTTP is definitely the web's language, and you should be aware of it.
Ok folks, this will be all for today. But keep tuned, next episodes are already in the pipes.
Thanks to Bertrand Zuchuat for giving feedback and ideas before the article publication.