When I started in security analytics several years ago, the choice of tool and platform was typically dictated for you, usually based on earlier investments the company had already made. These days, scientists have the opposite problem: a dizzying array of tools in a variety of licensing modes. The frustrations of limited toolsets have been replaced by the anxiety of choice. As wonderful as unlimited options may seem, in reality we must limit our options in order to be successful. Ideally, an organization can converge on a single choice: not perfect, but one that allows maximizing benefit while decreasing the challenges of maintenance.
At JASK, we have chosen a toolset that we think does that: Google Tensorflow. At a high level these were the reasons:
- Data Science needs a toolset that can take advantage of either CPU’s or GPU’s, or a mix of them.
- A product for model building must recognize that the best language for modelling is not the best language for algorithms.
- The experiences of local development and cluster development should be the same
We need more cowbell.
It seems intuitive to use as much processing power as a piece of hardware offers; unfortunately, we rarely have this option. Most notebooks and workstations either have a combined GPU/CPU on board (not always NVidia), and high performing GPU’s are a special-option only on most servers. On the other hand, while a GPU is fantastic at certain problems (matrix multiplication, for example) no class on GPU programming would tell you to do everything on a GPU. If you did hear this in a class, I recommend a supplement Heterogeneous Parallel Programming. Tensorflow meets this requirement: I can develop on a laptop with no GPU’s, then run the same node on a cloud instance with an array of GPU’s installed.
A statistician and a mathematician walk into a bar …
Back at University, Computational Finance and Applied Mathematics shared some faculty, even attended the same graduation ceremony. Yet, all their coursework was in R and ours was in Matlab, which I think is the most concise illustration of model vs algorithm building in terms of software tools. Here’s another one: some believe in having a minimal knowledge of each algorithm’s inner workings and a wide view of all the possibilities and available tools, while others believe in the need to understand fewer algorithms but deep enough to program them yourself. I now have a theory for a likely reason behind this: your position on the spectrum I described, is a function of how much hate and fear you have for C and C++ programming. To unite these examples, the Quant’s and the Amath’s both knew python, and to take advantage of decades of numerical optimization you have to do it in C (or let’s face it, Fortran). ML solutions must be built on something that can bridge these two worlds: Tensorflow’s Python code for the model, which is compiled into C builds that bridge.
Anyone know a pop culture reference about parallel programming?
As much as I would like every data scientist in the world to have their own Hadoop cluster, we know that’s not going to happen. Also, in line with Moore’s law, today’s laptop surpasses the main frame I helped my father load punch cards into when I was little. Doing your development on clusters is expensive, and debugging and testing become problematic as well. I have found that I am more willing to give up some application performance than to pay the price of easy debugging and testing. I find that with some education, data scientists can be persuaded to do their development with “small data”, and we can treat cluster paralleling and performance in a separate step. The ability to develop, test, and run on a local machine and then treat parallelization as a configuration step is a very nice thing about Tensorflow.
Does Tensorflow have everything we need? While baked-in visualization and a large user community are very beneficial, I would trade that for a tool that ran GPU’s from different vendors in a heartbeat. And while it was our choice, there are other good ones to evaluate for yourself. Your mileage may vary, when deciding whats the best tool for you, I recommend also looking at Theano, DSSTNE, and sklearn to see if they are a better fit for you.
But as a team, you have to start somewhere, and my experience has shown that “somewhere” should be somewhat close to what it will look like in production, and something that has enough capability so that you are not limited greatly or required to have 50 different software packages for 50 problems.