Guessing Bitcoin’s P2P Connections

The paper Deanonymisation of clients in Bitcoin P2P network (2014) by Biryukov, Khovratovich and Pustogarov (BKP), who describe an attack on Bitcoin Core clients, has started some discussion lately. The main idea of the paper is to first get a set of nodes $E_v$ to which your victim $v$ is directly connected to (“entry nodes”). Second, for each transaction $t$ record the $10$ nodes $P_t$ which first propagate $t$. The authors experimentally show that if $|E_v \cap P_t|$ is bigger than $3$ then there is a pretty high chance that $t$ actually originated from $v$. However, both attack stages basically require a Sybil attack – the attacker has to be connected to a lot of nodes a lot of times. ‘A lot’ means that in their experiments they had 50 connection to each full node (~250) in the test network. As a result, such an attack seems to be powerful, but certainly won’t be undetected.

In this post I show that the first stage of the attack, namely learning the nodes a victim is directly connected to can be done with a single connection to the victim. In addition to BKP’s attack, knowing all outbound peers of a client could significantly increase the success probability of a double spend. Note that all experiments are based on Bitcoin Core 0.9.4, but 0.10.0 shows the same behavior.

TLDR The attacker can reliably guess all of the outbound connections of a victim by making a selection from the known addresses of a victim based on the timestamp of the addresses.

Update A fix has been merged to bitcoind. The timestamp is not updated anymore when receiving a message from a connected peer. Instead, it is only updated when the peer disconnects. The fix is released in bitcoin core 0.10.1.

Privacy in BitcoinJ

As part of my epic quest to apply supervised machine learning to the blockchain in order to discover transaction patterns, I reviewed various wallet implementations in the hope of finding privacy leaks.

tl;dr If you are using a wallet that is built upon BitcoinJ, such as Android Wallet, Multibit and Hive Wallet, you have almost zero wire privacy. An attacker who manages to connect to your wallet is easily able to figure out all addresses you control. This is not very likely to get fixed in the near future.

Update: Mike Hearn’s reply addresses additional problems and improvements. There was also accompanying discussion on reddit.

FCW Kernel

A Feature Coinciding Walk Kernel classifies nodes in a graph.

This project is a python port of Coinciding Walk Kernels (CWK) [1] and introduces an extension of the model called Feature-CWK (FCWK). If you want to jump right into some code see the benchmark.

CW-Kernels deal with the problem of node classification (aka link-based classification) in which a set of features and labels for items are given just as in regular classification. In addition, a node classification algorithm accepts a graph of of items and item-item links. It has been shown that the additional information that is inherent in the network structure improves performance for certain algorithms and datasets.

Read More

Reading Noun-Noun Compounds

Early Influences of Compound Frequency and Semantic Transparency

My bachelor thesis in Cognitive Science. Unfortunately, I am currently not allowed to release the data nor the analysis scripts, because the dataset is still under active research.

Abstract: This thesis evaluates psycholinguistic theories about the cognitive processing of words. Consequently, the time-course of compound reading is analyzed using generalized additive models in a dataset of eye movements. The theories to be contrasted are sublexical (Taft and Forster, 1975), supralexical (Giraudo and Grainger, 2001) vs. dual route processing (Schreuder and Baayen, 1995) and form-then-meaning (e.g. Rastle and Davis, 2008) vs. form-and-meaning (e.g. Feldman et al., 2009) processing.

As the goal is to find the best model given various predictors, some general mechanisms of eye movements will be demonstrated, e.g. the position in the line has substantial effects, single fixations last longer, are on shorter words, more in the center of the word and influenced differently by frequency measures.

Inspired by Kuperman et al. (2009) it is shown that already the early eye fixations on words are guided by first constituent and compound frequency, providing evidence for parallel dual route models.

Similar to Baayen et al. (2013), Latent Semantic Analysis (LSA) similarity scores (Landauer and Dumais, 1997) permit investigating the time point of semantic processing. The effect of LSA similarity not only shows up in the earliest word fixations, but the data reveals that semantics plays a role even before a word is fixated. In particular, the fixation position in the word is more to the right, when the semantic transparency, i.e. the similarity between compound and second constituent is high. This evidence of parafoveal semantic processing challenges opposing findings obtained with the eye-contingent boundary paradigm (Rayner et al., 1986). In the framework of naive discriminative learning (Baayen et al., 2011), the effect of transparency on fixation position reflects optimization of the landing position for accessing the orthographic information that is most discriminative for the compound.

Keywords: reading, eye-movements, compounds, semantic similarity, morphological processing, generalized additive model

Read PDF

Influence of Reputation on Gricean Maxims

As part of the course “Game Theory and Pragmatics” by Jason Quinley at the Institute of Linguistics, I wanted to explore the influence of reputation on obeying the Gricean Maxims using data from the Q&A website Stackoverflow.

Pragmatics is a subfield in linguistics, defined as “dealing with the origins, uses, and effects of signs within the total behavior of the interpreters of signs” (Morris, 1946). Pragmatics tries to explain why a simple sentence like “It’s raining.” has a lot of different interpretations, for example(Franke, 2009):

  • The speaker advises we should take an umbrella.
  • The speaker declares the picnic cancelled.
  • The speaker is sick of living in amsterdam.

Regressionsmodelle Und Parameterschätzverfahren

Eine Einführung, die als Ausarbeitung für das Seminar “Maschinelles Lernen” an der Universität Tübingen entstanden ist. Die Grundlagen linearer, nichtlinearer, logistischer und Bayes Regression werden behandelt, sowie Verfahren zur Schätzung der Modellparameter aus statistischen Annahmen vorgestellt. Anschließend wird die logistische Regression auf den Titanic Datensatz angewandt und unter anderem gezeigt, dass das Motto “Frauen und Kinder zuerst” bei der Katastrophe zutraf.

Every plot is produced using the open source statistic software R inside the $\LaTeX$ file (Sweave). Code is here.

Charakteristisch für überwachtes maschinelles Lernen, zu der auch die Regression gehört, ist das Beschreiben der Beziehung von Zielvariable und erklärender Variable aus vorliegenden Daten, also Realisierungen von Zufallsvariablen.

Das Regressionsmodell stellt $y$ durch die Summe einer Hypothese von $x$ und einem Fehlerterm $\epsilon$ dar.

$$ y = h(x) + \epsilon $$

PDF weiterlesen

Visualizing Receptive Fields

The following article will visualize some mathematical models for brain cell activity. In some regions of the brain, neurons are excited or inhibited by neurons of a preceding input layer. They are called receptive field of that neuron. Since the visual area uses receptive fields as feature detectors (such as edge and edge-orientation detection) for natural images, the application of different receptive field functions on images can be nicely examined. The ipython notebook file to play with the parameters can be found on GitHub.

In [2]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import scipy.signal as signal
import numpy as n

We examine the effect on the following images. In the visual pathway the images can be seen as input from the retina to higher visual areas.

A* for MarioAI


Check out my A* Implementation. Your nodes can be easily added by inheriting the ASearchNode class. The purpose of the A* is to handle WorldNodes in the interface of the Mario AI Competion. The aim of the competition is to develop an artificial agent in the game Infinite Mario Bros who completes a level fastest. Like in Robin Baumgartens competition submission the WorldNodes are generated using a simulation and then searched with the A*. Our current goal at the Chair of Cognitive Modeling Tübingen is to connect the the planning with a rudimentary brain module which is capable of learning object interactions. Therefore the agent will wander around to collect knowledge about his environment and then exploit it according to its motivation.

Update Some of my successors enhanced Mario’s capabilities and submitted a great video demonstration that went viral to the AAAI Video Competition 2015.

Signaling Games and Language


Two papers I wrote with Roland Mühlenbernd last winter. We mainly explored what happens when the players are able to invent new messages in signaling games. The important finding in the first paper is that the players ability to make up new messages overcomes signaling bottlenecks and leads to successfull communication even if the number of strategies is high. When coupled with a reasonable mechanism for message extinction, the agents will play perfect strategies in the end. The second paper explores the evolution of the vowel space with this technique. The resulting partitions of the vowel space resemble these of actual languages.

Evolving Bitcoin Trade Bot

Bagalute is a trade bot, designed for buying a currency (bitcoin) at the the right time. As a decision criterium it uses the relative strength index (RSI). The bot looks at the price development of the last twelve hours and runs an evolutionary algorithm to determine the optimal parameters for the RSI in that time frame. Then, it uses these parameters for the next time frame. Currently the bot uses for debugging purposes a dummy interface to trade. The actual success of the bot was sparsely tested but had negative results, maybe due to a general downtrend of prices. I wrote it during my 2nd semester break