Thursday, October 20, 2016

Stop Writing Dead Papers

The idea struck me while listening to Bert Victor's talk "Stop Drawing Dead Fish" (and hence the title for this post).

We have been writing and publishing papers for more than 500 years (according to Wikipedia, one of the earliest journals started in the 17th century!) and yet, somehow, we are still using the same format and writing our papers as if hardcopies are the main, if not the only, medium for distributing them. Now this is really disturbing as we are now living in the 21st century where we have a way more powerful medium available to us: the interactive digital interfaces.

Most of academic papers written nowadays are dead: they are static with no interactive content. I'm talking particularly about papers reporting empirical results and showing graphs and tables filled with numbers and statistics, and supported with long discussions to help the reader understand and visualise (in her head) what can not be fully articulated by static content.

Take for instance the figure below appeared in Cho et al. (2014) paper, which is meant to visualise the space of representations of phrases of four words learned by a recurrent neural network. The authors clearly put a lot of effort into visualising the space and presenting their results in a convincing and expressive way. But because of the lack of interactive medium, they had to present the full graph (clearly hard to understand) and some closeups (not fully representative of the space). 

2–D embedding of the phrase representation learned with RNN. Cho et al. (2014)
Some zoom-ins from the figure above. Cho et al. (2014)
This is not only inadequate, presenting a number of figures to support an argument also takes up a lot of the limited space available in academic papers, which can be put for better use. Moreover, despite all these static illustrations, one wish she could hover over some points to highlight what they represent or zoom-in to get a better understanding.

While this format was totally accepted in the 17th century, it is way outdated in the 21st and no longer enough!

To compensate for the limitations of this medium that poorly accommodates our goals, a number of researchers started a tradition of writing blog posts that serves as fancier versions of their papers, usually supported with interactive visualisations and easier-to-access and understand analytics. Take for instance this great blog with many interactive examples for some of the results in Dai, et al. (2014) paper that tries to cluster Wikipedia articles. You can still see the same figures presented in the paper (like the one below), while also being able to interact with them and play with the parameters.
 Visualisation of Wikipedia paragraph vectors using t-SNE. Dai, et al. (2014)
Now I understand that this does not apply equally to all fields (I don't expect researchers working in the field of literature to move directly, or accept such a new medium). But I believe that researchers with a computer science background to be capable of making, and arguably welcoming, such move. I believe such functionalities could be integrated in new editing tools or traditional ones (such as LaTex web-based editors), and ultimately, if papers could be submitted in a scripting language format, say in php (or an editor built on top of it) in which many interactive tools already available can be easily integrated, one could have the opportunity to take creativity and accessibility of academic papers to a whole new level.

As someone who read, write and review papers, I'm really looking forward to the day where academic papers become more interactive and I strongly believe that this will lead to research that is highly accessible, easier to understand and evaluate and more fun to work with.

Tuesday, October 4, 2016

Experience-Driven Content Generation: A Brief Overview

Exploring and implementing methods for measuring experience, understanding and quantifying emotions and personalising users' experience have been the focus of my research for quite sometime. In this post, I will try to summarise some of my knowledge in this area.

The Big Picture

My theory is that: if I can accurately predict user’s affective states at any point during her interaction with a digital system, I can ultimately implement affect-aware methods that can automatically personalise the content leading to an improved and deeper user experience. I'm particularly interested in studying these aspects within the computer games domain as I believe games are a rich medium for expressing emotions, an interesting platform for collecting, recognising and modelling experience and an easy to controllable environment for personalisation of content.

These ideas have gathered interest from numerous researchers working in interdisciplinary areas trying to solve parts of the puzzles. There is for instance a whole field of research trying to measure and quantify emotions; a relatively new, but very fast growing, field on automatic generation of content in games (there is a new book about this subject here); and a growing interest in linking ideas from these two areas so that we can build content generators that are centred around users' experience as a core component in the content creation process. 


None of the above is easy and implementing a complete working system where all modules work together effectively is a big challenge. In my own work, I'm interested in realising the affective loop in games (see the figure below). I have a working implementation of what I believe a simple, yet easily extendable prototype of how the whole framework works in the game Super Mario Bros. (speed forward to the end of the post if you are eager to play!). My work revolves around revising and improving the different parts of the system so that the framework becomes general enough to be applied to predicting users' affects and personalising experience in any game (or more broadly, any digital interface). 

The components of the experience-driven content generation framework.
Ultimately, we want the system to be active (accurately choosing what information is important to learn from), adaptive (continuously learning and improving), reactive (acting in real time), multimodal (utilising information about the user from different sources) and generic (working well in various applications). I have made a progress along a number of these lines and I will be sharing them in individual posts that will follow. For now, I want to share insights about some of the main considerations towards realising the framework.

Features for Measuring Player Experience

If you survey the literature you will find numerous methods for gathering information of users' experience or emotion. Here are the three main dominant types:

Subjective measures: The most obvious measure is to ask players' about their experience. This method is the easiest to implement, and hence the widely used, but it comes with a number of limitations (including subjectivity, cognitive load and interruption of the experiment) as well as other concerns related to the nature of the experimental design protocol. So, to compensate for the drawbacks, other complementary or alternative methods are usually employed to gather information from other modalities. 

Objective measures: These are usually harder to control by the subject and therefore more reliable. Most of them also universal making them highly scalable. Your heart rate, for instance, can reveal information about your excitement and your brain activity can tell whether you are surprised, under cognitive load, or relaxed. Your facial expressions can tell if you are happy or angry and your head pose can tell whether you are engaged, attentive or bored. Information gathered by such measures is more reliable than the subjective ones but obviously harder to collect, annotate and analyse. Moreover, some of the equipments used for collection are quite intrusive that they can’t be used in real-life interaction settings. Therefore, most of the widely used methods rely on accessible and unintrusive mediums such as web cameras and Kinect devices to analyse facial expression, extract gaze information and capture body gesture to infer emotion.

Interaction data: The interaction between the users and the digital interface also holds patterns that can help us understand users' experience. Gameplay data for instance, is a rich, easy to collect, and reliable source of information for profiling players. By relying only on this modality, methods have been developed for predicting retention, progression analysis, discovering design flaws and clustering players for target segment marketing and content customisation

When it comes to modelling players experience in games, I believe a multimodal approach that combines and align information from multiple sources is the most effective. Gameplay data is the main source of information about experience that is usually analysed by most studies in academia and the industry. I believe information coming from relatively cheap sources such as the camera (which is already available in most gaming platforms) will soon become another standard for analysing emotion and improving the prediction of the experience, especially with the recent advancement in accurate real-time prediction of emotion from videos of faces. I believe in the not-too-distant future, there will be no need to ask users about their experience as other reliable modalities will provide accurate, less intrusive sources. 

Facial reaction of players playing Super Mario Bros. when losing, winning and faced with a challenging encounter.

Methods for Feature Processing 

The above types of features come with different forms: some are discrete numbers while others are sequences of temporal or spacial relationship. This means that different methods should be employed to handle each type and special care should be taken when combining different sources of different nature. For instance, gameplay data can be collected as discrete statistics about different actions taken or as continuous sequences of actions taken at each timestamp. Different methods can be applied in each case leading to various insights. 

Discrete features are the most common and can be directly processed by most machine learning methods (they should of course be cleaned and normalised in most cases). 

Features of continuous nature, such as objective measures of experience can be processed with methods that are sensitive to time-series data such as recurrent neural networks and regression analysis. Sequences can also be processed to extract meta data such as frequently occurring patterns. This can be done using frequent pattern mining methods. 

Combining features from multiple modalities can be tricky especially if they are of different nature. The signals should be aligned and either transformed into the same space or handled on multiple resolutions. Take for instance a system receiving a continuous signal from a facial-emotion recogniser and discrete statistics about the keyboard buttons pressed. To combine such information, one option would be to process the continuous signal and transform it to discrete values of emotions calculated within specific intervals. Another option would be to handle each signal by an appropriate method and combine the results in a later stage. A third option would be to sync the features so that we can extract the facial reactions around each keyboard press action.  

Methods for Modelling User Experience

You can imaging user experience models as magic black-boxes where you feed them with information about your users and the interface they interact with and in-return, you get useful categorisations or profiling that you can use for decision making. The input information can be one or a combination of the features presented above. The output can be an estimation of how fit the content is for a particular user, what profile best match the user (is she a buyer in amazon, a fighter in a first-person shooter, a puzzle-solver in an online course) or a recommendation of the best adjustment of the interface that will increase user's engagement.

Now any machine learning methods that can accurately estimate the mapping between your input and output can be implemented. The most widely used methods for profiling for instance are supervised or unsupervised clustering and classification methods such as support vector machines, self-organizing maps or regression models. Non-linear regression models are more powerful when attempting to predict affective states based on behavioural information. One can use neural networks, multivariate regression spline models or Random Forest to reduce the size of the search space while optimising the mapping functions. When trying to come up with recommendations or personalised content, efficient search and filtering methods can be applied such as collaborative filtering, genetic algorithms or reinforcement learning. There are many interesting applications for each approach and the choice of the appropriate method depends on the type of the data you have, the type of the problem you are trying to solve and the characterisation of the insights you are interested in.  

Adaptive User Modelling

The experience models I talked about so far are average models, meaning: they apply equally to all players and they are not tied to a certain individual. They serve a very good purpose if we want to ship them with the system and if we are looking for methods that work well with the majority of users. But we can do better.

The accuracy of the models is very much confined by the data used for training; your data need to be divers enough to include representatives of the majority of your users. Event when your data distribution is wide enough, it is very likely that the method will not recognise every individual. People are different and each of us has her unique preferences and ways of interactions. To accommodate for different personalities, one could implement adaptive systems that keep learning, improving, and personalising as the user interacts with the system. The models learned offline forms a good starting point for an initial rich experience and for learning more powerful personalised versions. Model improvement can be achieved through a brach of methods called active learning. Active learners attempt to improve their performance by sampling the instances from the space that lead to the fastest improvement. This means that they try to learn as much as possible about the user in the quickest way possible so that they become more accurate predictors of experience. Doing so, they also become more personalised for a specific user.

The Future

We live in a time where more and more data about the users is becoming available and where people from academia and the industry are eager to understand the users and make better decisions. We are also equipped with powerful methods that facilitate realisation of such goals. There are indeed a number of interesting research directions that can improve our understanding of users, emotions, behaviour and how emotion is manifistated through behaviour. Moreover, data-driven content personalisation is also a hot and interesting topic where lots of improvement could be done. I'm confident however that a lot can be achieved already with what we currently have in terms of data and methods.  

Now just for fun, I will leave you with an example where many of the ideas presented are implemented to improve the experience of the interaction with the system.   

Example: Content Personalisation in Super Mario Bros.

You will play an initial level of Super Mario Bros. and the system will collect information about how well you did. This information, along with your choice of the type of experience you would like to explore, are used by the system, using machine learning methods, to explore the space of possible content you would prefer. The best piece of content is then chosen and presented to you. 

Let the fun begins! Try it here (PS. you will need a browser with a support for java, otherwise, you can download the demo jar file from here (option 4)). 


If you would like to read more about the subject, there is a nice paper by Georgios Yannakakis and Julian Togelius that you can find here. The demo above is described in the paper here.

Wednesday, September 7, 2016

Content-based Game Classification with Deep Convolutional Neural Network

One thousand video clips of one hundred games. The games are clustered according to their t-SNE components applied on the output of a CNN trained to classify RTS games. You can interact with the interface here.
The figure above generated in fast motion.
The figure above is from an interactive demo for this article that you can find here. I recommend that you have a look at it before continue reading.


A while ago, I started working on convolutional neural network within the computer game domain. I was particularly interested in expanding their success to video games and investigate whether they can be used to learn features about games similar to what they do with images and videos in many other areas. In this post, I will explain what I did so far and I will show some of the recent results.


There has been a lot of work recently on video classification, tagging and labelling. My interest lies in bringing these ideas to games. My hypothesis is that video game trailers and gameplay videos provide rich information about the games in terms of visual appearance and game mechanics that would allow CNNs to detect similarities along a number of dimensions by "watching" short video clips.

Gameplay 2M dataset

As you already know, CNNs are data hungry, so I started by collecting the data I need. I was looking for videos of gameplay classified according to a number of categories. The easiest way I found to collect the data is to prepare a list of game titles, download YouTube videos of gameplay form different channels and associate each game with a set of categories I eventually got form Steam.

So, I initialised the process and I started running experiments when I had data for 200 games ready. For each game, I downloaded 10 gameplay video. Since those vary in length, I cropped a 5-minute segment from each of them. Then for each segment, I randomly sampled 10 half-second shorter clips. Finally, from these short clips I extract 100 frames. If you do the calculation, you will see that I ended up with 100*10*10 = 10000 gameplay images per game, so the dataset I will be using for this post contains 2M gameplay images.

As for the game classes, I query Stream on categories assigned to each game by the users. I ended up with a 24-D vector of categories including whether the game is an action, single-player, real-time strategy, platformer, indie, first-person shooter, etc. Each game is assigned to one or more of these categories. To create one category vector for each game, I averaged them per category and used a simple step function with a threshold of 0.5 and assign the final vector to each game (more specifically, to each image).

Here are some short clips from some of games I used for training and the categories they belong to according to Steam users:

Full Spectrum Worrier: RTS = 1, Action = 1, Single-player = 0

Empire Total War: RTS = 1, Action = 1, Single-player = 1

Team Fortress: RTS = 0, Action = 0, Single-player = 1


Most of recent work in deep learning rely on established state-of-the-art models and fine tune it on a new dataset. I follow this stream of work as training from scratch is very time and resource consuming. Some state-of-the-art CNNs are very good in extracting visual feature representations from raw pixel data. In my work, I use the convolutional layers of the VGG-16 model to extract generic descriptors from the gameplay images.

I train on static images of gameplay extracted from the videos (I believe adding temporal information will improve the results, but I wanted to start simple and build from there). I built classifiers for only the three categories: RTS games, action games and single-player games as those provided the most balanced data in terms of belonging to positive and negative classes but I will be running more experiments once I have more data. 

To build the classifiers, I first pass all images through the convolution layers of the popular VGG-16 model to extract the visual feature descriptors that I later use to train NN classifiers. Each classifier constitutes of the convolutional layers from VGG-16 then two dense layers of 512 nodes each. Finally, I use a sigmoid function that output the probability of an image belonging to a class.

I trained three binary classifiers to learn each category independently (I could as well have used other multilabel learning methods but this is what I use for now). I split the data into three sets for training (70%), validating (20%) and testing (10%).

VGG-16 artitecture with two dense layers of 512 nodes each. 

Analysis, how good are the classifiers?

The three classifiers performed remarkably well in terms of classification accuracy. I got accuracy up to 85% when classifying action games on the image level and the results for RTS and single-player games were slightly lower reaching 0.76% and 0.72%.  I also calculated the accuracies in other settings where I average the performance per 0.5-sec clips, 0.5-min clips and per game.  In some cases, it seems that looking at multiple images will indeed increase the accuracy while in others (when classifying action games), the model was just as accurate on individual images as it is on the whole game.

Following some inspiring work (here and here), I further looked at the distribution of the classes according to the first two t-SNE components (performed on the PCA results of the output of the first dense layer of the classifiers). I did this for a sample of the dataset (neither my machine nor t-SNE has enough power to process the whole dataset) and you can clearly see the classification boundary between positive and negative samples on the 5-min clips. 

t-SNE visualisation of the distribution of 15000
half-second clips classified by the RTS classifier. 
t-SNE visualisation of the distribution of 15000
half-sec clips classified by the single-player classifier.

I also looked at the distribution of games as I thought this is particularly interesting because the network has no explicit information during training that specifies from what game the images come from (it only knows whether an image is from a particular class or not). If my genetic image descriptors are powerful enough, I expected images/clips of the same game to cluster together. So I regenerated the same figures as above, but this time the colour code I used was game titles so that images or clips belonging to the same game will be given the same colour.
Same figure as above but points are coloured
by game title (RTS classifier).

Same figure as above but points are coloured
by game title (
Single-player classifier).

You can clearly see some cluster of clips belonging to the same game preserved quite well. This is a really interesting finding as it seems that somehow the models learned an implicit representation of the games although they didn't really trained to recognise them.

This last finding meant that games with similar visual features according to a given category should also be projected close to each other. So this time, I visualised the distribution of 5-min clips from the RTS classifier while showing the title of the games. Here is how the figure looks like with some zoom-ins.

Some zoom-ins from the t-SNE distribution of the output of the RTS classifiers.

Analysis, how different is the data?

Of course some videos are better representative of a game than others and therefore I expect to get variations in accuracies on the images and videos levels. To give you an idea of how the accuracy changes per image, here are some of the results from the action-games classifier for seven games. The performance is clearly different among games but there are also clear fluctuations within the same game. For some games, such as Hexen II and Team Fortress (number one and five in the figure) you can confidently tell by looking at the graph that they have a strong action element.
Accuracy per image by the action game classifier for seven games. 
So, why do some images give high accuracies while others don't. What is it that the network is interested in? Since I'm using a pre-trained models for visual feature extraction, visualising the convolutional layers won't really help. I instead looked at the individual images with high and low accuracies for some games. Here is an example from the game Hexen II when the classifier is trained to see it as an action game.
Accuracy per image for the game Hexen II by the classifier of action games.
What I can tell for now (from these snapshots and many others I visualised), is that the amount of lighting matters quite a lot, the more the light, the higher the action. Similar analysis in RTS games showed that panels such as these below, even when only partially shown, are what contribute the most to recognising games as RTS.

For some videos, the models are more confused. This happens a lot when the category classified is a minor feature of the game and not one of its main characteristics. This in fact is the main reason I prefer to use a sigmoid function as an output for the classifiers. I can then interpret the output in a probabilistic form and say that a low probability translates to showing a small amount of a specific feature. This allows me to better understand the games and means I can define a similarity function on these vectors to find out what games are similar to each other and in what aspects, but more in that in the future.

Finally, some snapshots from the demo you saw at the top of the page. Here, I tried to visualise the five-minute clips according to their t-SNE dimensions. Since I only care about their clusters, and not their exact position in the space, I calculated the distance between all of them and connect each node to 10 of its nearest neighbours. To make it easier to understand the graph, I also gave the nodes belonging to the same game the same colour. If you zoom-in you can see the titles of the games and what games are connected to each other. The figures below are from the results of the RTS classifier.

Now this certainly doesn't allow me yet to draw conclusions on what and how games are similar but I believe that with more data and classification of more dimensionalities, we can build a powerful tool for automatic content-based classification of games.

This work is done in collaboration with Mohammed Abou-Zleikha.

Tuesday, August 9, 2016

Summary: How to Start a Startup: Lecture 1 (by Y-Combinators)

A year ago, I get an idea for an app that I believe can help improve parents' life by making it easier for them to connect with old friends and make new ones. I named it Menura and as I started working on it, I wanted to learn about the process of starting a new business and building a network. So I attended some events in Copenhagen, where I live, and I met some great people. One of them is David Helgason, the founder and former CEO of Unity. We talked about best practices when starting and the best resources to learn from. We mentioned reading books, meeting people among some other things, but the one thing he highly recommended was that I go and listen to the "How to start a startup" lectures by Sam Altman, the president of Y-Combinators, and so I did.

The series contain 20 lectures of about 45 min length each and were presented initially at Stanford University in 2014. Sam brought together a great group of experienced and successful people to talk and share lessons from their own experience starting (now some million worth) startups. Speakers include for instance, Peter Thiel, known as the co-founder of PayPal, Reid Hoffman, co-founder of LinkedIn, Aaron Levie, co-founder of Box, Ben Silbermann, co-founder and CEO of Pinterest, and Paul Graham, the co-founder of Y-Combinators.

I listened to the lectures while Menura was in its early stages, and I enjoyed and learned a lot from everyone of them. But now that I'm almost done with the development phase that I need to execute the following steps, I feel like I don't recall many of the details in the lectures. So I decided to listen to them once more but this time, I decided to take notes of the important points to keep them as a reference for the future. I will be sharing my notes so anyone interested can benefit from them. Note however, that these are my personal notes which mean they are subjective and you might end up focusing on other ideas if you listen to the lectures yourself (which I highly recommend). Nevertheless, I think they are interesting and worth sharing.

Without further due, let's get started!
Lecture 1: How to Start a Startup
To start a successful startup, you need to excel in four main areas:
1.     Idea: 
o   Execution is harder and 10 times more important
o   Bad ideas are still bad (even with great execution)
o   Think long term
o   Should be difficult to replicate
o   Needs critical evaluation that includes
§  Market: size, growth
§  Company: growth strategy
§  ...
2.    Product
3.    Team
4.    Execution
Where success = idea * product * team * execution * (w* luck)
where w is a number in the range [0,10000] and what is nice about it is that it is somehow controllable :).

Starting a startup is really hard:
1.     Do not do it to become rich (there are easier ways)
2.    Do it if you have a solution to a problem
3.    Ideas first and startup second
4.    The good idea is the one you think about frequently when not working
You should focus on a mission-oriented startup:
  • You are committed = you love what you are doing
  • You have a great patience: startups take about 10 years
Good ideas are unpopular but right:
  • You can practice identifying them
  • They look terrible at the beginning
  • Start with a small market to create a monopoly and then expand
  • You will sound crazy but be right
  • Look for an evolving market (big in 10 years)
  • The market is better when it is small and growing rapidly, it means the market is more tolerant and hungry for a solution
  • You can change everything but the market
  • Answer why now?
  • To build something you yourself need is better to understand the problem
  • The idea should be explainable in one sentence
  • Think about the market (what people want) first
Good practice:
  • Be confident
  • Stay away from nay-sayers (most people if it is a good idea)
Good Product is something users love:
  • Until you build it, nothing else matter
  • Spend your time building and talking to customers
  • Marketing is easy when you have a great product
  • Better to build something a small number of users love than to build something a large number of users like. Easy to expand from there
  • Find a small set of users and make them love what you are doing. 
  • Build a product that is so good that it will grow by the word-of-mouth
  • Most companies dies because they didn't make something users love, not because of competition
  • Start with something simple (I like what Leonardo Da Vinci said about this "simplicity is the ultimate sophistication" and Steve Jobs' famous quote "Simple can be harder than complex")
  • Quality and small details matter
  • Be there for your customers (even at midnight)
  • Recruit feedback users by hand (this is the stage where I am at with Menura right now and I can't tell you how hard it is, you should literally send personal emails and messages to every single one of you potential interested users and you should keep the conversation going.)
  • Do not do ads to get initial users, you don't need many, you need committed ones
  • Loop from feedback to product decisions by asking the users:
    • What they like/dislike
    • would they recommend it to others
    • have they recommend it already
    • what features would they pay for
  • Make the feedback loop as tight as possible for rapid progress
  • Do it yourself (that includes everything, from development to marketing to customer support...)
  • Startups are build on growth, so monitor it
  • If this (your product) is not right, nothing else will matter
Discussion about team and execution is left to the next lecture and we will now move on to answer the most important question (in my opinion): why you should start a startup?
Why you should start a startup?
Probably you thought about it as being Glamorous (you will be the boss, it has attractive flexibility, you will be making impact and $$). In reality, however, it is a lot of hard work and it is pretty stressful. Here is why:

You will be:
1.     having a lot of responsibility
2.    always on call
3.    taking care of fundraising
4.    gathering media attention (not always what you like to see)
5.    strongly committed  
6.    managing your own psychology 
And here is a more elaborate explanation of what you might think is attractive about it:
1.     Being the boss: not really true (you will be listening and executing everyone else needs and feedback)
2.    Flexibility: also not true as you will be always on call, you are the role model, you are always working 
3.    Having more impact and more $$: you might actually make more money joining Facebook or Dropbox, and you get to work with a team so you might end up making more impact
After some thought provoking points, it is now time to find out the real reason you should have to start your own startup. It is actually pretty obvious: you simply 
"can't not do it"
This means:
1.     You are passionate about it
2.    You are the right person
3.    You gotta make it happen
4.    You can't stop working on it
5.    You will force yourself into the world to achieve your vision
Pretty nice and thoughtful introduction. Now that was the end of the first lecture and the finishing slide was recommendations for some book. 

I have personally started reading Zero to One I'm really enjoying it (the rest are on my reading list, which is growing very fast :)). I will probably share some summaries about in another blog, but that's it for now.

Main takeaways:
(These are the main points that stick into my mind after listening to the whole lecture)
1.     Ideas are important but execution is vital
2.    Make something people love
3.    A small number of people loving your product is more important than a large number liking it
4.    Get your product right and everything else will follow smoothly from there
5.    Build your own product-lover small community and rely on the power of WoM
6.    If starting a startup is the thing you can't do without, then you are on the right track (good luck, enjoy the journey!). Otherwise, join one of the great companies.
See you in the next lecture :)!