This quarter we had an subject with a an assignment to be developed in Erlang, a functional language which oriented to concurrent programming and interprocess communication through message passing. The result is a crawler where each domain has an assigned “thread” which has to make the requests to the web server, plus another one to download the images and index them using pHash, the program is composed of more parts but now we'll center in this.

(By the way, the project has been developed on the open, the code is available at it's GitHub repository, EPC).

At the beggining each thread simply made a call httpc:request, which is the way that the standard library offers to make this requests, but it seems that the concurrency is not very well handled, this produced starvation on the indexation process.

Further down on the specification a possible solution is shown:

Option (option()) details:
sync
     Shall the request be synchronous or asynchronous.
     Defaults to true.

Anyway, at that moment that solution wasn't checked, instead another two where implemented, one was a dedicated process which will make sure that the indexer downloads had priority and it won't suffer starvation, this won't happen to the crawlers because it'd take some time to the indexer to obtain the features of the downloaded image (not much, but some time), this was implemented in pure Erlang and it was the one merged in the master branch.

Another option was to implement the download as a port, an external program written in C and that will be called from an Erlang process, this possibility was kept in the GET-by-port branch.

C - Erlang Communication

The port is composed of two components, the C part an the Erlang part, the communication can be made in multiple ways, and is defined along with PortSettings on open_port, the possibilities are

  • {packet, N}

    Messages are preceded by their length, sent in N bytes, with the most significant byte first. Valid values for N are 1, 2, or 4.

  • stream

    Output messages are sent without packet lengths. A user-defined protocol must be used between the Erlang process and the external object.

  • {line, L}

    Messages are delivered on a per line basis. Each line (delimited by the OS-dependent newline sequence) is delivered in one single message. The message data format is {Flag, Line}, where Flag is either eol or noeol and Line is the actual data delivered (without the newline sequence).

    L specifies the maximum line length in bytes. Lines longer than this will be delivered in more than one message, with the Flag set to noeol for all but the last message. If end of file is encountered anywhere else than immediately following a newline sequence, the last line will also be delivered with the Flag set to noeol. In all other cases, lines are delivered with Flag set to eol.

    The {packet, N} and {line, L} settings are mutually exclusive.

In this case we'll use {packet, 4}, enough to send whole web pages.

Communication - The C side

Lo let's focus in what happens in the program written in C when it receives the data, the function which manages this is char* read_url()

The process is simple, read 4 bytes from stdin and save it as a uint32_t

1
2
3
4
uint32_t length;
if (fread(&length, 4, 1, stdin) != 1){
    return NULL;
}

The it converts the data from big-endian to the host endianness, this is done through the function ntohl()

1
length = ntohl(length);

The rest is simply read a string from stdin, without any transformation and knowing it's length

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
char *url = malloc(sizeof(char) * (length + 1));
if (url == NULL){
    return NULL;
}

unsigned int already_read = 0;
while (already_read < length){
    already_read += fread(&url[already_read], sizeof(uint8_t),
                          length - already_read, stdin);
}
url[length] = '\0';

Returning the data to Erlang doesn't need much more effort, as it can be read in the void show_result(headers, body) procedure. The data is sent in two groups, first the headers and then the result body, sending the headers means converting it's size to big-endian using htonl and writting it to stdout, to then write the whole string directly

1
2
3
4
5
6
7
8
9
/* Strange things can happen if we forget this */
uint32_t headers_size = htonl(headers.size);
fwrite(&headers_size, 4, 1, stdout);

unsigned int written_head = 0;
while (written_head < headers.size){
    written_head += fwrite(&(headers.memory[written_head]), sizeof(uint8_t),
                           headers.size - written_head, stdout);
}

... and repeat this for the response body

1
2
3
4
5
6
7
8
uint32_t body_size = htonl(body.size);
fwrite(&body_size, 4, 1, stdout);

unsigned int written_body = 0;
while (written_body < body.size){
    written_body += fwrite(&(body.memory[written_body]), sizeof(uint8_t),
                           body.size - written_body, stdout);
}

This is all it takes to complete the interface with Erlang, the rest is common C logic, in this case would be to make HTTP requests, something easy using cURL, and each time we may want to receive a URL from Erlang we'll only need to call read_url(), and of course the compilation is made the usual way.

Communication - The Erlang side

The logic managed for the Erlang side isn't complex either, it's only to call open_port, which returns the PID where to send the data, assuming that we have defined HTTP_GET_BINARY_PATH with the path of the compiled binary that completes the port

1
Port = open_port({spawn, ?HTTP_GET_BINARY_PATH}, [{packet, 4}])

When sending the data to the binary is needed, it should be sent through this PID, sending a tuple like

1
{Actual_process_PID, {command, Message}}

For example

1
Port ! {self(), {command, Msg}},

This is converted to the data which the binary receives, in the same way when it sends data back it'll be received as a message like

1
{Port_PID, {data, Received_message}}

Since we're receiving two, the headers and the message body...

1
2
3
4
5
6
7
8
9
receive
    {Port, {data, Headers}} ->
        receive
            {Port, {data, Body}} ->
                From ! {self(), {http_get, {ok, {200,
                                                 process_headers(Headers),
                                                 Body}}}}
        end
end,

The result is sent as a message, this is because it's made to run inside a loop and keep the port active in an isolated process until the process which created it terminates, but it's not a need to use it this way, there's no reason it couldn't return the result directly as a “normal” function, though ¿maybe? it can have problems coordinating the access to the input/output if multiple process are using it.

... and that's all, we have a piece of C code running from Erlang, of course since the interface is stdin/stdout any language can be used, it's a very flexible design :)