Synthetic data for financial applications has its supporters and detractors, but the more deep learning takes hold in finance, the more it’s going to be considered for training machines.
Maurice Chia, Fellow of the Institute of Mathematics and Its Applications, said that as the financial industry continues to embrace alternative data, like big real data scraped from the web, it will be wary of legislation that prevents such scraping.
As a result, synthetics might then become more useful. As it stands today, practitioners are still exploring how to integrate alternative data with traditional data in the investment process, he noted.
One area he pointed to that synthetic data might help is in “extreme” situations where standard Monte Carlo techniques don’t cut it.
“Monte Carlo runs a random generator which may not test a specific extreme event, like say if cryptocurrencies all plunge and they bring the stock market down with it. So synthetic data can be used to simulate this scenario,” Chia said.
Companies are sprouting up offering synthetic data, and Chia noted Alegion as a case in point.
Another company, Neuromation is taking a somewhat more novel approach. We spoke with chairman and founder Constantine Goltsev to get his take on how research into synthetic data is unfolding.
MarketBrains: You’ve recently had an oversubscribed fundraising, can you tell me a bit more about your plans for development off the back of that?
Constantine Goltsev: We just had a fundraising round, the total was about USD$55 million (USD$34.2mn at time of publication), 46,000 ETH in terms of Ether.
We’re going to spend a certain percentage on the platform, development, on reserve for tokens to support token economics, and a certain percentage is going to be just company reserve, a cash reserve, for acquisitions for example.
We don’t make explicit commitments because the situation is fluid, but all the money raised is going to go to company and platform development.
MB: Why did you start Neuromation?
CG: We incubated Neuromation over the last year as a project that was supposed to solve one of the key problems in artificial intelligence and that key problem is lack of proper datasets, to realistically train a lot of models.
Average deep learning algorithms and classification problems need from several hundred to several thousand labelled examples.
“We’re building a marketplace for AI services, based around synthetic data and computing capacity”
MB: And that’s built on synthetic data? What is that exactly?
CG: Real data has to be sampled, measured and labelled, for example, photos that you have to snap, then you have to label the data, and present them to the algorithm to learn on. And the problem with real-world data is it’s difficult to collect it.
If you need pictures of people you need to go and get copious amounts of them, if you need pictures of emotional states, you need to sit in on thousands of volunteers, (for them to) smile, yawn, be bored.
As an industrial applications example: if you want to measure the level of rust on a steel beam you have to find 10,000 steel beams with all different ways they can rust, and label them.
Even trivial things, (like) recognizing dogs from cats, and cats from flowers, is still an enormous amount of work. Google labelled 10 million images, but they spent several million dollars and several years doing that.
And it turns out the dataset is a major roadblock because deep learning algorithms can learn and associate anything, but in order to do that, you have to present it with good examples.
So then there’s artificial data, or synthetic data: in the case of computer vision, you just simulate 3D graphics, it’s a video game that snaps these pictures and then feeds them into a computer. For sounds, these are manufactured sounds, you report the sounds of birds, tweak it around, making it a bit random, simulate it, and then the computer will make sounds of generic birds.
If it’s rust on a beam, so, industrial applications, if you have structural integrity, look at this wall, or airplane part, if it needs additional maintenance, you can set up a physics model of corrosion of the metal and that’s going to produce billions and billions of examples on which artificial intelligence can learn.
“What can happen with synthetic data is, you can potentially create a market simulation.”
MB: Anything very specific to finance?
CG: Finance is extremely random, and when you have your dataset from the markets, it’s a complete random walk most of the time.
What people would do with this, they generate parameters and run some kind of Monte Carlo simulation and then work with pure statistics.
What can happen with synthetic data is, you can potentially create a market simulation. You can take all of these micro models of stock trading, set up artificial traders and have them generate artificial markets.
And once this artificial market is generated, you calibrate it to the basic statistical parameters of the real market.
The beauty of the artificial market that you have created, artificial traders could trade, and they generate the price curves.
If they can know the state of each and every one of them at any given time then you can train an artificial intelligence algorithm on this data and learn (and) have that algorithm correctly identify the underlying states of these traders.
Then you go to real data, you splice real data with artificial data and you train the algorithm on that.
What’s going to happen is the algorithm is going to find pre-supposed patterns in the real data, now that might not be 100% accurate, but it might be insightful because now the artificial AI algorithm can find structure from just the stochastic data series in the real market.
And to be more concrete, for example, in your generator you create a certain state where some members of the market start panic selling, some others would be in different mode, and AI algorithms can decipher through those modes and show data, and then work with real data to try work this hypothesis out.
“…expect us to have these blended financial datasets and models that are going to (be) the example for people to develop systems further”
Imagine we have trader A, trader B, trader C, their behaviour would generate some kind of underlying state of the order book plus some kind of market conditions simulation. And each of these (traders) has an agenda, they all want to make more money but they have different styles.
Because you know deterministically the state of these actors: a state of fear, this is a buyer, this is a holder, this will generate certain price points.
Then you train the deep learning algorithm to, at each state, understand these underlying factors. And this is throughout the system.
So, if you trained it well, then anytime you are going to feed a data point to the AI, it will give you the states of the underlying players in the pools.
What you do with this AI algorithm, if you see that association is good, you calibrate the statistics of the series to match the statistics of the real data and you blend them together, and on blended data you run this algorithm and then you see how it manages artificial data and then the superposition of artificial and real data.
It will give you some interesting hypotheses on the real market as well, so this way you train the algorithm on the real market’s complexity.
If you give just the underlying state of the market, these drivers, artificial data can be applied here. This is research that I am going to be doing, these types of problems, it applies to every single system where you blend synthetic data with real-world data
MB: And the research on that, do you have financial participants on the development platform now?
CG: We plan to. For now, we are hiring academics. So, expect us to have these blended financial datasets and models that are going to (be) the example for people to develop systems further.
MB: And you’re choosing to go with a tokenized platform system for this development?
CG: The blockchain gives it global transactability and accountability. With stuff like airline points, they can always erase an entry in a database and it’s gone. Crypto, you can’t do that, that agreement can have some kind of lasting value.
MB: How concerned are you about regulations?
CG: We are very concerned about regulations, we employ one of the best legal firms in Estonia where we are incorporated, we observe all the regulations, current and the future. We are worried how they might change, but we are within the current framework.
MB: Ideally what would the system look like for the benefit/reward structure you are trying to set up?
CG: We’re building a marketplace for AI services, based around synthetic data and computing capacity. Our clients are going to come and order services and these services are going to be paid for in Neurotokens (NTK), which are going to be distributed to vendors who provide the services.
And whoever is the best price, they are going to complete the task and get rewarded NTKs. You have a transaction, you put it in a pool with some kind of bid-ask price, and the miners take transactions out of the pool and they process them, and they get ethers, for instance.
In our case, a request, say for a synthetic dataset, is going to come on our platform and a reward in NTKs is going to be attached to it, and this is going to be broadcast throughout the blockchain to vendors.
The miner once they’ve got the NTKs, they can spend on the exchange, for dollars, ethers, whatever, and if I am a client I can go into the exchange and buy NTKs so that I can order services through the platform.
And blockchain, this component will make it distributed, so all Neuromation does is handle these transactions, you won’t even have to go on our platform to order something eventually.
This interview is edited and condensed