Evolutionary computation · TEFcon 2014

by Guido García on 5/11/2014

This is my presentation (spanish) about evolutionary computation for the TEFcon 2014. It was a talk about how we code and how genetic algorithms and genetic programming might help us. Because “programming should be more about the what and less about the how”.

 
5 Likes
Hold on
No Comments yet. Be the first.

DTrace para flipar pepinillos

by Guido García on 25/08/2014

This is my presentation (spanish) about DTrace, a tracing framework created by Sun that is really cool. It is available on “Solaris” systems, but also on OSX, BSD, and some Linux ports.

It is a really powerful tool once you get used to it.

 
3 Likes
Hold on
No Comments yet. Be the first.

How to catch an Internet troll

by Guido García on 25/04/2014

Some weeks ago I carried out a social experiment (dameunverso.com, spanish) that consisted in writing a poem in a collaborative and anonymous way. This means that anyone can add a new verse to the poem without identifying themselves or leaving any metadata (no cookies, no IP address tracking, etc).

Our first trolls didn’t take long to appear, mostly in the form of copyrighted material, spam and offensive contents. Is it possible to automatically classify an anonymous verse as spammy?

Troll

Text classification

LingPipe is a powerful Java toolkit for processing text, free for research use under some conditions. I followed the Text Classification Tutorial, to classify verses in one of these categories: “spam” or “love”.

The classifier I built uses 80% of the poem (already classified into “spam” or “love” categories by hand), as a training set to learn and build a language model. Then, it uses the remaining 20% of the poem (48 verses) for cross-validation of this model.

You can find the code in the Annex I, it is less than 50 lines of code.

Classification results

The classification accuracy is 75% ± 12.25%, so we can say that our model performs better than a monkey at significance level of 0.05.

Categories=[spam, love]
Total Count=48
Total Correct=36
Total Accuracy=0.75
95% Confidence Interval=0.75 +/- 0.1225

Confusion Matrix
Macro-averaged Precision=0.7555
Macro-averaged Recall=0.7412
Macro-averaged F=0.7428

It seems pretty promising and it can serve as inspiration but, to be honest, I don’t think it is such a good model. With so few contributions to the poem, it is prone to overfitting, so it is probably learning to classify just our usual trolls that are not original whatsoever.

Moreover, we are not taking into account other factors that would greatly improve the results, such as the structure of the verse (length, number of words, etc), the relation between the verses (rhyme) or the presence of inexistent words and typos. If you want to further investigate, I suggest taking a look at Logistic Regression, to build better models that also include these kind of factors.

On a practical note, if you ever plan to carry out a similar experiment, remember two rules. First, make it easier for you to revert vandalism than for the troll to vandalize your site. Second, don’t feed the troll. They will eventually get tired.

Annex I. Java Code

String[] CATEGORIES = { "spam", "love" };
int NGRAM_SIZE = 6;

String textSpamTraining =
        "Ola ke Ase\n" +
        "censurator\n" +
        "...";

String textLoveTraining =
        "Me ahogo en un suspiro,\n" +
        "miro tus ojos de cristal\n" +
        "...";

String[] textSpamCrossValidation = {
        "os va a censurar",
        "esto es una mierda",
        "..."
};

String[] textLoveCrossValidation = {
        "el experimento ha revelado",
        "que el gran poeta no era orador",
        "y al no resultar como esperado",
        "se ha tornado en vil censor",
        "..."
};

// FIRST STEP - learn
DynamicLMClassifier<NGramProcessLM> classifier =
        DynamicLMClassifier.createNGramProcess(CATEGORIES, NGRAM_SIZE);

{
    Classification classification = new Classification("spam");
    Classified<CharSequence> classified =
        new Classified<CharSequence>(textSpamTraining, classification);
    classifier.handle(classified);
}

{
    Classification classification = new Classification("love");
    Classified<CharSequence> classified =
        new Classified<CharSequence>(textLoveTraining, classification);
    classifier.handle(classified);
}

// SECOND STEP - compile
JointClassifier<CharSequence> compiledClassifier =
                (JointClassifier<CharSequence>) 
                        AbstractExternalizable.compile(classifier);

JointClassifierEvaluator<CharSequence> evaluator =
        new JointClassifierEvaluator<CharSequence>(
                compiledClassifier, CATEGORIES, true);

// THIRD STEP - cross-validate
for (String textSpamItem: textSpamCrossValidation) {
    Classification classification = new Classification("spam");
    Classified<CharSequence> classified =
        new Classified<CharSequence>(textSpamItem, classification);
    evaluator.handle(classified);
}

for (String textLoveItem: textLoveValidation) {
    Classification classification = new Classification("love");
    Classified<CharSequence> classified =
        new Classified<CharSequence>(textLoveItem, classification);
    evaluator.handle(classified);
}

ConfusionMatrix matrix = evaluator.confusionMatrix();
System.out.println("Total Accuracy: " + matrix.totalAccuracy());

System.out.println(evaluator);
 
11 Likes
Hold on
No Comments yet. Be the first.

Lazy loading of modules in nodejs

by Guido García on 27/03/2014

This is a pattern I found in pkgcloud to lazy-load nodejs modules. That is, to defer their loading until a module is actually needed.

var providers = [ 'amazon', 'azure', ..., 'joyent' ];
...

//
// Setup all providers as lazy-loaded getters
//
providers.forEach(function (provider) {
  pkgcloud.providers.__defineGetter__(provider, function () {
    return require('./pkgcloud/' + provider);
  });
});

It basically defines a getter, so modules won’t be loaded until you do:

var provider = pkgcloud.providers.amazon;

It might be useful in applications where you have different adapters (“providers” in the example above) offering different implementations of the same API, and you want to let the user choose which one to use at runtime. This is a common requirement in cloud environments but it could be applicable to other scenarios as well (e.g. choose a payment gateway).

This is the first time I see it, so please share your thoughts on it and any other alternative approaches.

 
24 Likes
Hold on
2 Comments

Cloud is not cheap

by Guido García on 18/03/2014

There is a myth about cloud computing. Many people think they will save money moving their services to the cloud, but the reality is that the cloud is not cheap.

Virtualization, one of the core parts of cloud computing, tries to meet the promise of elastic capacity and pay-as-you-go policies. Despite of this promise, the true story is that today we are running virtual machines that don’t do much because, most part of the time, our applications are not doing anything. Their processors are underutilized. While this is an opportunity for cloud providers to oversubscribe their data centers, it also means we are overpaying for it. There is still much untapped potential for applications running on the cloud.

Services in the 21st century

In the last few years we have seen many improvements in the way applications are packaged and deployed to the cloud, how to automate these processes, and we have learnt that we have to build applications for failure (see “There will be no reliable cloud“).

But what I have not seen yet is anything about services communicating to each other to share its health status. I think services in the cloud should be able to expose their status in real time. This way they could talk to others and say “hey, I’m struggling to handle this load, who can help me out with 2 extra GB of RAM for less than 10 cents/hour?”.

How do you think cloud will change apps in the next 5-10 years?



Ryan Dahl – How do you see the future of PaaS (see 4:38)

 
6 Likes
Hold on
No Comments yet. Be the first.

The long tail in this blog

by Guido García on 23/01/2014

This blog is two years old, and I’d like to share how its >50K visits are distributed.

Long Tail

One single post drives 40% of the traffic to the blog. At the bottom, 70% of its posts represent 4% of the traffic.

In my opinion, the most popular ones are not the best ones. They are about very specific technical subjects, containing keywords in the title and in the URL slug. Google does the rest.

 
5 Likes
Hold on
No Comments yet. Be the first.

Performance is premature optimization

by Guido García on 18/01/2014

I will burn in hell, but performance is premature optimization nowadays. Despite it is very interesting from an engineering perspective, from the practical point of view of someone who wants to follow the make-shit-happen startup mantra, my advice is not to worry much about it when it comes to choosing a programming language.

There are things that matter more than the technology stack you choose. In this post I will try to explain why; then you can vent your rage in the comments section.

Get Shit Done

Your project is not Twitter

It is not Facebook either, and it probably won’t. I am sorry.

Chances of your next project being popular are slim. Even if you are so lucky, you app will not be popular from day one. Even if you are popular enough, hardware is so cheap at that point that it could be considered free for all practical purposes (around one dollar per day for a 1CPU/1GB machine; go compare that with our wages).

Your project will fail

Face it. You are not alone, most projects fail and there is nothing wrong with it. They fail before performance becomes an issue. I do not know a single project that has failed solely due to a bad choice of a programming language.

So I think that, as a rule of thumb, it is a good idea to choose the technology that allows you to try and develop small components faster (nodejs, is that you?). You will have time to throw some of those components away and rebuild their ultra-efficient alternatives from scratch in the unlikely case of needing it.

Conclusion

You are not going to need performance; stop worrying and get shit done instead. I always have a Moët Et Chandon Dom Pérignon 1955 on the fridge to celebrate the day I face performance issues due to choosing X over Y.

 
5 Likes
Hold on
3 Comments

Function parameters in Python, Java and Javascript

by Guido García on 18/01/2014

This is a short post about how these programming languages compare with each other when it comes to declaring functions with optional parameters and default values. Feel free to leave alternatives in other languages in the comments.

Python. The good.

Python is my favorite. Use your parameters in any order and define their default values as part of the function signature itself.

def foo(arg1, arg2="default"):
    print "arg1:", arg1, "arg2:", arg2

The price to pay is that you can not define two methods with the same name in the same class.

def sum(a, b):
    return a + b

def sum(a, b, c):
    return a + b + c

I am not a Python expert, but it does not seem such a big deal.

Java. The ugly.

Java is more verbose, but you have strong types and simple refactoring in exchange.

public void foo(String arg1) {
    foo(arg1, "default");
}

public void foo(String arg1, String arg2) {
    System.out.printf("arg1: %s arg2: %s", arg1, arg2);
}

Javascript. The bad.

Javascript is a little more ugly.

function foo(arg1, arg2) {
    arg2 = arg2 || 'default';
    console.log('arg1 %s arg2 %s', arg1, arg2);
}

This is real code we use in Instant Servers, to have an optional first parameter:

CloudAPI.prototype.getAccount = function (account, callback, noCache) {
    if (typeof (account) === 'function') {
        callback = account;
        account = this.account;
    }
    if (!callback || typeof (callback) !== 'function')
        throw new TypeError('callback (function) required');
    ...
}

It is pure crap.

 
2 Likes
Hold on
No Comments yet. Be the first.

Give your configuration some REST

by Guido García on 2/01/2014

I have built a simple configuration server to expose your app’s configuration as a REST service. Its name is rest-confidence (github). In this post I will try to explain its basics and three use cases where it could be useful:

  1. To configure distributed services.
  2. As a foundation for A/B testing.
  3. As a simple service directory.

Install and run a basic rest-confidence configuration server

The first step is installing the configuration server:

git clone https://github.com/palmerabollo/rest-confidence.git
cd rest-confidence
npm install

After that, you are ready to edit your config.json configuration file. For example:

{
  "mongodb": {
    "host": "localhost",
    "user": "root"
  },
  "redis": {
    "host": "redis-server",
    "port": 6379
  },
  "logging": {
    "appender": {
      "type": "file",
      "filename": "log_file.log",
      "maxSize": 10240
    }
  }
}

Launch the configuration server (npm start) and you are done. You are now ready to start retrieving the values associated with any key, in a hierarchical way:

# curl http://localhost:8000/logging/appender
{"type":"file","filename":"log_file.log","maxSize":10240}

or

# curl http://localhost:8000/logging/appender/maxSize
10240

Use case #1: Configure distributed services

In my last post I wrote about why I like nodejs, a great platform for building micro-service-based architectures. However, these kind of architectures also come with their own drawbacks. One of them is that they are more difficult to deploy and configure.


Micro Service Architecture

Micro Service Architecture. Image courtesy of James Hughes


With a centralized configuration server such as rest-confidence everything becomes easier. Instead of configuring hundreds of settings on each component, you only need to configure the URL of your configuration server. Your service will go there to look up any configuration property it needs.

Use case #2: A/B testing

A/B testing is a simple way to test different changes to your application and determine which ones produce positive results.

As a simplistic example, imagine you want to test an alternative color for your blue sign-up button, and check how it affects the conversion rate. You can define a $filter with a $range limit in your configuration:

{
  "color": {
    "$filter": "random",
    "$range": [
      { "limit": 10, "value": "red" }
    ],
    "$default": "blue"
  }
}

So when you retrieve the “color” property value using a random filtering criteria, you’ll get different colors depending on the ranges.

# curl http://localhost:8000/?random=5
{"color":"red"}

And with a different filtering value out of the range you will get the default value.

# curl http://localhost:8000/?random=15
{"color":"blue"}

Use case #3: Simple service directory

You can use rest-confidence as a simple service directory, that is, a centralized server that facilitates dynamic location of other services’ endpoints, based on different criteria.

{
  "myservice": {
    "$filter": "env",
    "production": {
      "url": {
        "$filter": "country",
        "ES": "http://myservice-production.es",
        "UK": "http://myservice-production.co.uk",
        "$default": "http://myservice-production.co.uk"
      },
    },
    "development": {
      "url": "http://myservice-production.com" 
    }
  }
}

With some criteria applied (for example, env=production and country=ES) you will get the proper service endpoint, or any other information you need:

# curl http://localhost:8000/myservice?country=ES&env=production
{"url":"http://myservice-production.es"}

I hope you find it useful. There is also a nodejs client. Contributions are welcome.

 
4 Likes
Hold on
No Comments yet. Be the first.

Why is node.js so cool? (from a Java guy)

by Guido García on 9/12/2013

I confess: I am a Java guy

At least I used to be. Until I meet node.js. I still think the JVM is one of the greatest pieces of technology ever created by man, and I love the Spring Framework, the hundreds of Apache Java libraries or the over-six-hundred-page books about JEE patterns. It is great for big applications that are created by many developers, or applications that are made to last.

Java

But many applications today are not made to last. Sometimes you just want to test something fast. Fail fast, fail cheap, keep it simple… the “be lean” mantra, you know.

Moreover, open source has completely changed the way we build applications, moving from developing tons of code in monolithic applications to assembling small programs that use third-party components as middlewares (nosql databases, queues, caches).

Second confession: I hate(d) Javascipt

Yes, Internet Explorer 4 made me hate Javascript. So the first time I heard about node.js and server-side Javascript I felt a shiver down my spine. It got worse when I started to play with the unfamiliar continuation-passing style, the asynchronous callback hell did not take long to appear.

Node is Asynchronous

A simple pattern: function(err, result) {}

But the absence of rules does not necessarily has to mean chaos. In fact, there is one pattern in node.js: your callbacks will have two arguments; the first argument will be an error object, the second one will be the result. This is your contract with the platform and, more important, with the community. Stick with it and you will be fine.

Using such a popular programming language plus this simple convention is what makes it so easy to start working with node.js. It makes building small modules that work together with other developers’ modules surprisingly easy. This is why we have more than 50K modules in the npm registry. Most of them are probably worthless, but natural selection also applies here, and this evolutionary process is much faster than the Java Community Process (JCP).

With node.js I feel like a productive anarchist. I get shit done.

You should also read “Broken Promises“, “Why is node.js becoming so popular” (quora), and watch Mikeal Rogers’ talk on why is node so successful (24 min).

 
5 Likes
Hold on
3 Comments