DATA INFRASTRUCTURE
A panel of business-savvy technologists was assembled to discuss what you should expect as your firm builds out its data effort.
Our panelists
John Urbanik spearheaded engineering and data science efforts at a social media analytics firm acquired by Palantir and for several Fortune 500 companies post-acquisition. He is currently Lead Data Engineer at Predata, a platform generating market insights from social and web traffic metadata.
David Cheng was a developer with King Street Capital Management, a $20bn credit focused fund. He is now the Chief Technology Officer at System2, a consultancy providing sourcing, engineering, and analysis of big data as a service.
Webb Dryfoos oversaw a $42m data budget, and in that capacity developed an unparalleled depth of expertise around how data is generated, sourced, delivered, and processed across several industries. He is now the Chief Data Officer at OnSpot Data, a geo-spatial mobile data provider.
The discussion involved going through the costs and benefits of alternative data work across four stages of a data effort's development:
STAGE 0: OFF-THE-SHELF SOLUTIONS
Benefits
Investment team gets a sense for the kinds of questions alternative data can help them answer
Costs
Headcount: 1-2 interns OR part-time help from data-savvy analyst
Data: usually start with third-party scraped data and pre-processed transaction data
Infrastructure: none needed
STAGE 1: PROOF OF CONCEPT
Incremental benefits
Data begins to be surfaced in firm’s usual Excel sheets, emails, etc
Business builds comfort around cadence of its interaction with the data team
Roadmap is drawn up for how the effort can be developed further
Incremental costs
Headcount: 1-3 people at 120-200k each or 1 senior hire at 300-500k
Data:
Scraping moves in-house
Might buy raw transaction data at $2m+ per year
Other: locations, emails, clickstream
Infrastructure:
Data storage on Redshift or BigQuery, PostgreSQL, etc; $1k / TB; expect to spend less than $1k to a few thousand per month.
Analysis:
BI tool like Tableau ($700 / seat / year) or Power BI
SQL engine like Presto
Spark for heavy processing
For web scraping:
Services: ParseHub, ScrapingHub
Open source tools: Portia, Scrapy, Apache Nutch
NLP / ETL: SpaCy, NLTK, OpeNLP, CoreNLP
Hosting: (AWS) Lambda, EC2 / ECS
STAGE 2: GROWTH STAGE
Incremental benefits
Data architecture begins to take shape and can feed persistent dashboards
More sophisticated analysis possible e.g. by cohorts, across datasets, &c.
Can start screening novel and less widely known datasets
Incremental costs
Headcount: 3-5 more people at 120-300k each; specialised hires in data engineering or visualisation become more necessary
Data: sourcing ramps up; have to start working to make sure the work is aligned with investment team interests
Infrastructure:
Focus shifts from task management to orchestration; relevant tools include:
Airflow, Luigi, Oozie, Azkaban are standard
Can also use Ansible, Puppet, Chef, Salt, etc, but config more painful
Databricks can be helpful - Jupyter-like notebooks on top of Spark; similar solutions can be created atop Presto or Impala
Collaboration tools are critical to keeping costs in check - that's why there has been a lot of interest in tools like Domino, Sense, Alpine Data, and Mode Analytics
STAGE 3: MATURE EFFORT
Incremental benefits
Cutting-edge nowcasting machine built to be intuitive to your firm’s analysts
Novel risk models and stock screens
Systematic trading strategy you can choose to allocate funds to
Incremental costs
Headcount: further hires; founding members need bumps
Data: sourcing via VC - e.g. Two Sigma and SumAll
Infrastructure:
Emphasis on quality control of existing data universe and on making it easier to cross-link disparate datasets
Headcount considerations:
Effort is most efficient if some analysts have extensive experience with Python / R as well as statistics and can work directly with Spark
Minimal knowledge can do more harm than good - can lead to massive inefficiencies, basic conceptual mistakes, etc
Data headcount explodes if everything has to be surfaced using BI tools - templates pile up and they each need to be maintained
To keep compute costs in check and to allow for near-instant responses to queries, it is important to identify datasets and understand intermediary outputs common to most queries and cache them in an intermediate layer
Relevant tools include MemSQL, Redis, and Cassandra
DATABRICKS DEMO
Outside some specialty cases, at present the most efficient platform for big data work is Spark. Databricks was founded by the creators of Spark and contributes over 75% of the open source code. Databricks has developed a platfrom with proprietary tools and resources to enhance and complement core Spark features.
Further information on Databricks is available in this deck. To discuss how investment firms are leveraging this platform, please contact Jason Ferrante and mention Augvest.