Data infrastructure

DATA INFRASTRUCTURE

A panel of business-savvy technologists was assembled to discuss what you should expect as your firm builds out its data effort.

Our panelists

John Urbanik spearheaded engineering and data science efforts at a social media analytics firm acquired by Palantir and for several Fortune 500 companies post-acquisition. He is currently Lead Data Engineer at Predata, a platform generating market insights from social and web traffic metadata.

David Cheng was a developer with King Street Capital Management, a $20bn credit focused fund. He is now the Chief Technology Officer at System2, a consultancy providing sourcing, engineering, and analysis of big data as a service.

Webb Dryfoos oversaw a $42m data budget, and in that capacity developed an unparalleled depth of expertise around how data is generated, sourced, delivered, and processed across several industries. He is now the Chief Data Officer at OnSpot Data, a geo-spatial mobile data provider.

The discussion involved going through the costs and benefits of alternative data work across four stages of a data effort's development:

STAGE 0: OFF-THE-SHELF SOLUTIONS

Benefits

Investment team gets a sense for the kinds of questions alternative data can help them answer

Costs

Headcount: 1-2 interns OR part-time help from data-savvy analyst

Data: usually start with third-party scraped data and pre-processed transaction data

Infrastructure: none needed

STAGE 1: PROOF OF CONCEPT

Incremental benefits

Data begins to be surfaced in firm’s usual Excel sheets, emails, etc

Business builds comfort around cadence of its interaction with the data team

Roadmap is drawn up for how the effort can be developed further

Incremental costs

Headcount: 1-3 people at 120-200k each or 1 senior hire at 300-500k

Data:

Scraping moves in-house
Might buy raw transaction data at $2m+ per year
Other: locations, emails, clickstream

Infrastructure:

Data storage on Redshift or BigQuery, PostgreSQL, etc; $1k / TB; expect to spend less than $1k to a few thousand per month.
Analysis:
- BI tool like Tableau ($700 / seat / year) or Power BI
- SQL engine like Presto
- Spark for heavy processing
For web scraping:
- Services: ParseHub, ScrapingHub
- Open source tools: Portia, Scrapy, Apache Nutch
- NLP / ETL: SpaCy, NLTK, OpeNLP, CoreNLP
- Hosting: (AWS) Lambda, EC2 / ECS

STAGE 2: GROWTH STAGE

Incremental benefits

Data architecture begins to take shape and can feed persistent dashboards

More sophisticated analysis possible e.g. by cohorts, across datasets, &c.

Can start screening novel and less widely known datasets

Incremental costs

Headcount: 3-5 more people at 120-300k each; specialised hires in data engineering or visualisation become more necessary

Data: sourcing ramps up; have to start working to make sure the work is aligned with investment team interests

Infrastructure:

Focus shifts from task management to orchestration; relevant tools include:
- Airflow, Luigi, Oozie, Azkaban are standard
- Can also use Ansible, Puppet, Chef, Salt, etc, but config more painful
- Databricks can be helpful - Jupyter-like notebooks on top of Spark; similar solutions can be created atop Presto or Impala
Collaboration tools are critical to keeping costs in check - that's why there has been a lot of interest in tools like Domino, Sense, Alpine Data, and Mode Analytics

STAGE 3: MATURE EFFORT

Incremental benefits

Cutting-edge nowcasting machine built to be intuitive to your firm’s analysts

Novel risk models and stock screens

Systematic trading strategy you can choose to allocate funds to

Incremental costs

Headcount: further hires; founding members need bumps

Data: sourcing via VC - e.g. Two Sigma and SumAll

Infrastructure:

Emphasis on quality control of existing data universe and on making it easier to cross-link disparate datasets
Headcount considerations:
- Effort is most efficient if some analysts have extensive experience with Python / R as well as statistics and can work directly with Spark
  - Minimal knowledge can do more harm than good - can lead to massive inefficiencies, basic conceptual mistakes, etc
- Data headcount explodes if everything has to be surfaced using BI tools - templates pile up and they each need to be maintained
To keep compute costs in check and to allow for near-instant responses to queries, it is important to identify datasets and understand intermediary outputs common to most queries and cache them in an intermediate layer
- Relevant tools include MemSQL, Redis, and Cassandra

DATABRICKS DEMO

Outside some specialty cases, at present the most efficient platform for big data work is Spark. Databricks was founded by the creators of Spark and contributes over 75% of the open source code. Databricks has developed a platfrom with proprietary tools and resources to enhance and complement core Spark features.

Further information on Databricks is available in this deck. To discuss how investment firms are leveraging this platform, please contact Jason Ferrante and mention Augvest.