Is “data scientist” the new “programmer”?

[+] thidr0|7 years ago|reply

Something about this article strikes me as a thinly-veiled complaint about poorly designed object-oriented systems. Take, for example this comment by the author:

>Even if the money were half of what today’s coder gets paid it might still be a better job because one is spared the tedium of looking at millions of lines of Java that do almost nothing!

What all those millions of lines of code are is abstractions, decoupling, and modularization of logic/responsibilty. This is hard-won knowledge from the field of software engineering. Granted, a lot of it is probably very poorly designed or organized. But the problem is the design and not the philosophy.

Because scientists all use the same basic rules of math, but each business will each have it's own special rules (i.e. not all payroll software implements the same policy/axioms), this makes it really easy for the hard logic of scientific work to be in a general-purpose library. "Normal" developers need to customize their own rules, or in other words, develop their own services unlike the data scientists.

Now if every data-scientist had to roll his/her own version of numpy, pandas, sci-kit learn, tensorflow, etc. the author would probably be decrying the deluge of procedural spaghetti produced by data scientists. The data scientists' notebooks look simple because of all that indirection is hidden away in the many libraries.

[+] skybrian|7 years ago|reply

All of today's software is built on millions of lines of code. This isn't much of a problem when it's hidden away behind a good abstraction. Being written in another language forces the API to be documented well enough so you don't need to go deeper: "native code", "kernel code", "part of the browser".

Crappy million-line Java apps are generally crappy not due to raw line count but rather due to leaky abstractions and badly designed APIs, so you do have to navigate through a lot of that code.

[+] romaniv|7 years ago|reply

>But the problem is the design and not the philosophy.

If the philosophy prescribes dozens of tools for managing complexity and no tools at all for reducing it than it is the problem.

"Abstractions, decoupling, and modularization of logic/responsibilty" are not some kind of universal good. They are only useful within specific contexts. A lot of software engineers do not understand this and routinely engage in premature abstraction. As a result they produce systems that are 10 times more complicated than they need to be for absolutely no reason.

Java definitely encourages this kind of mentality, because the language itself and its standard library lack in some fundamental areas. Introduction of lambdas and streams helped, a lot, but the overall mentality is still well-entrenched.

[+] wiz21c|7 years ago|reply

I've been hearing the same complaint between Cobol and Java for years : it was simpler before, more efficient, etc.

Of course it was, but you were tied to one system (no application server), security was login/pw, database had no constraint, typing systems were ultra limited, everybody has its own way of writing batches (no Spring), business code was mixed with tons of technical code (no JPA).

Now, sure, if you glue some R, some SQL, etc. you can extract insights worht millions of dollars. But all of that exist just because we have digitalized all of the processes, data collection, etc. And the rise of data scientists will continue only if there are more stuff put in the databases, thanks to you plain, regular, normal programmers...

[+] watwut|7 years ago|reply

Personally, I am increasingly convinced that a lot of this hate comes from programmers with weak abstract thinking who simply cant do it. Instead of admitting that there is learning cure involved, they will claim the system is bad and everyone else is bad. Compounding factor is difficulty dealing with system that was written by different people who holded different opinions.

Yes, there are badly designed large system. No large system is perfect. However, there are also reasonably designed large systems, including in Java and Java is used in such system for a reason. It is more challenging to write large system. Yes, it is harder when parts of system are written in style that was considered best practice few years ago but was abandoned since then.

If you are spending a lot of time looking at millions of lines of Java that do almost nothing, them you likely dont really know what it what and need to read up more. At least that is my experience.

[+] Cthulhu_|7 years ago|reply

The problem isn't the abstractions, it's the sheer size of the codebase. The author - and I think most people - prefer a codebase they can grok. Nobody can grok millions of LOC, at best they can have a high level overview of what does what.

At that point you need the abstractions and practices that make code boring.

[+] sowhatquestion|7 years ago|reply

It would be far from the first HN article (or comment) that failed to make a distinction between "poorly designed object-oriented systems" and the very idea of OOP, design patterns, etc

[+] quadyeast|7 years ago|reply

nit: is it "about poorly-designed, object-oriented systems"?

[+] vinayms|7 years ago|reply

The take down on abstraction and software engineers (by using Java as an example) is similar to saying "back in the day to find a prime number we would simply use a sieve, but today it is a tedium, what with all the pi's and e's and thetas that get in the way, and what are geometry and polynomials doing here, and what in the god's name is this i, I just want to count the prime numbers which are nice round whole numbers".

That's what happens when a topic grows from being a curiosity where dilettantes dabble into a proper field that is applied to solve problems. Granted, some of the developments can indeed be tedious and self indulgent, but otherwise this is the natural progression. Its sad and frustrating when people who ought to know better make such statements. Is it done to provoke a critical analysis, positive trolling if you will?

About the role of data scientist, I find it both amusing and disappointing that just about anyone with a three week MOOC gets to work in this field, who otherwise had never dealt with statistics before. I mean, statistics is a three year long grueling applied maths degree, and condensing it to three weeks is silly. It is actually in this way that it is similar to progamming job of the 90s (I don't know how it was in the 70's, I wasn't born yet). Just about anyone who could learn Java or VisualBasic, or the self taught cowboys who used C, ended up programming professionally. Actually it was not that bad, for coding is not as hard as its made to be, but that until they got sucker punched by the n squared complexity, to say the least, on a big data. Coding couldn't help them and they realized programming was more than learning to code and using some API's and system calls. (I was one of them in a way, when I started to code in C++ to model and simulate my mechanical engineering project, and it lead me to the path of enlightenment.) So, today's data scientists who are not bonafide statistics graduates or statisticians have it coming as well, whatever the analogue is, unless they are merely "data monkeys" in which case all is well and as expected.

[+] dnautics|7 years ago|reply

I'm a scientist (wet lab) by training, a programmer (back end) by profession, and a data scientist by hobby (I have a machine learning project that I'm working on), and most of "data science" is not really stats... There will be a bit of stats at the end product but really the bulk of the necessary work is data curation. Annoying stuff like making sure my data fit into the right buckets.

I did have to debug a memory leak that only showed up when I deployed my data pipeline on 22 cores.

[+] karmakaze|7 years ago|reply

"data monkey" is even more of a fitting name than the corresponding "code monkey". Much of so called data science leans more on the side of data engineer where one fits existing solutions to your specific data. The split of data scientist and data engineer is the most unfortunate. It's like splitting programming into program design and development (opposite of devops) in a specific language. That's done too but usually the spec is functional (behavior not FP sense) and not algorithmic.

If this pattern works for anyone, great keep running find it he limitations. I just believe that there will be a better structuring and selective application movement.

A true data scientist would be doing research into new solutions or high level improvements. This can't happen at typical sized companies unless it's the core product and not a feature of one.

The big data, data scientist/engineer bandwagon is a little like blockchain. Everyone wants to leverage it, there are places where it is suitable but not everywhere where it's applied.

[+] oldandtired|7 years ago|reply

A good programmer is someone who can communicate with the problem domain experts and provide an a solution to fit their problem. Someone who understands the limitations of the computing environment and can engineer a solution that is adequate for the problem space problem.

Many who consider themselves programmers produce solution space solutions that just don't get to the core of the problem space problems. This is a function of the simple fact that many programmers never have the opportunity to see what the problem space experts are doing or actually need. This is a real shortcoming in the education of programmers.

We don't have to be subject matter experts in all fields, we just need to become competent in being able to understand the kinds of problems that are being faced by the various subject matter experts that we build systems for.

On the other side of that coin are those who are subject matter experts who think it is easy enough to become competent programmers. What they miss is the essential problem that programming is, itself, a field that requires a subject matter expert. I have come across too many systems that have been developed by the subject matter experts that were just wrong. Wrong in design, wrong in understanding the limitations of the tools being used, wrong in oh so many ways.

To build properly functional and functioning systems requires the cooperation, input and continual communication between those who are subject matter experts facing problem space problems and those who are subject matter experts in computing systems. This is a rare event and so we see the problems in every field with the computing systems that currently exist.

[+] FranzFerdiNaN|7 years ago|reply

I have never seen someone with “a 3 week mooc” getting a data science job. In fact, those jobs are being gatekeeped to a ridiculous degree, suddenly asking for PhDs for jobs that are barely more than a regular BI job.

[+] thisisit|7 years ago|reply

> who otherwise had never dealt with statistics before.

And why is statistics required? Let's face it most companies who need "Data Scientists" are looking for regular BI guys with fancy terms. Most of the problems are solvable using out of box functionalities in python/keras etc. Sure there are places and problems which require hard mathematics and stats but those are few and far between.

[+] hef19898|7 years ago|reply

Did you ever the pleasure to work with a Six Sigma Black Belt who had the most of one week statistics training? I am one, but honestly this is just enough to do some back-of napkin number crunching for purely operational purposes. That you need a math PhD is maybe exaggerating as well on the end of the spectrum.

That being said, the abilit to talk to domain experts and accept their experience is one of the most important skills for a true data scientist. Without proper context all the data in the world gets you nowhere.

[+] digitalzombie|7 years ago|reply

> statistics is a three year long grueling applied maths degree, and condensing it to three weeks is silly.

I agree. I was in a very prestige organization and they didn't know what a statistician really do and just hired CS machine learning PHDs. Even those people don't even know what statistician does. One person gave me ISLR when I ask for advice to get hire at this prestige place (I did an equivalent of this over several graduate courses in statistic program).

Another person proudly told me that in his project, he was using GLM stating he knows GLM. I asked what was the link function and that person stated he didn't know and it's somewhere in the code...

I've since then double down on statistic and will be going into Biostatistic field instead of data science. It feels like there are a lot of impostors especially in start ups and government organizations in data science. I have no clue why but there is just this culture in tech industry that have made me left it for better field. I've intern in the biostat field it is much better. CNN and other have listed with high quality of life.

[+] analog31|7 years ago|reply

What I took from the post was not that a "data scientist" qualifies as a "programmer" in the modern sense, but in the sense of the kinds of things programmers did in the 1970s. And maybe he's expressing some nostalgia for those times.

I learned programming around 1982. I didn't pursue a programming career, but went to college and majored in math and physics. Today I often use programming in the way that a data scientist might, solving problems using high level tools. The data that I deal with are physical measurements. I'm not employed as a programmer.

I also work with a lot of programmers, so I get a glimpse of what they're doing, maintaining a million-line code base. And I have to admit that being thrust into that environment would have me waxing nostalgic about the good old days too. I'm happy doing what I'm doing, and happy that someone knows how to turn my stuff into production code if it ever gets to that point.

What I'm really doing is applying my domain knowledge in a domain that happens to depend heavily on computation. To answer Greenspun's question, what I'm doing is certainly more interesting -- to me. I have colleagues for whom wrestling with the monster code base, and the kinds of engineering it requires, are their source of fascination.

[+] minimaxir|7 years ago|reply

> Consider that the “data scientist” uses compact languages such as SQL and R. An entire interesting application may fit in one file. There is an input, some processing, and an output answer.

The argument this post is making is reductive. Yes, sometimes data science is simple. Sometimes it isn't, and that's when you really need someone with the appropriate skillset.

[+] portal_narlish|7 years ago|reply

Data scientist, programmer and software engineer are different things. They are not disjoint by any means, but this guy is conflating them in a way that's totally wrong.

Software engineers have to engineer things. They deal with production applications, distributed systems, concurrency, build systems, microservices... coding is sometimes only a small part of the job.

Data scientists nowadays do programming in interest of research, modeling and data visualization. But they are not only programmers - they are usually supposed to have an applied statistics or research background. Some also do software engineering, especially at companies serving data science/ML in their products.

A programmer is actually someone like a data analyst or business systems developer. They don't have to build systems themselves, they just write loosely structured code against existing systems. Like writing SQL queries for dashboards, or drop-in code for things like Salesforce. This is probably the closest thing to what he's describing as the "70s archetype". Minus the deep optimization stuff.

[+] Fiahil|7 years ago|reply

I agree with you. I've seen brilliant Data scientists struggling to understand how git branching works. But, as you say, their principal focus is applied statistics, not programming.

My role as a software engineer is to create a good enough architecture so their can use properly the information contained in their 60 GB CSVs.

As a side note, I also noticed that clients have no issue paying a lot for _Data Science_, but for the "software guys" ? That's a whole other story, despite being of equal importance to the project.

[+] kosei|7 years ago|reply

I think you're taking this analogy too literally.

[+] Jtsummers|7 years ago|reply

Deming, from Out of the Crisis (1986):

  People with master's degrees in statistical theory accept
  jobs in industry and government to work with computers. It is
  a vicious cycle. Statisticians do not know what statistical
  work is, and are satisfied to work with computers. People
  that hire statisticians likewise have no knowledge about
  statistical work, and somehow suppose that computers are the
  answer. Statisticians and management thus misguide each other
  and keep the vicious cycle rolling. (p. 133)

This is what today's data scientists are. Last century's statisticians, similarly hired for misguided reasons (we need them because our competitors have them!).

[+] thousandautumns|7 years ago|reply

I'm not sure I follow. Specifically, what is meant by, "Statisticians do not know what statistical work is, and are satisfied to work with computers."?

[+] r-bit-rare-e|7 years ago|reply

[deleted]

[+] manfredo|7 years ago|reply

No, at least in my understanding data scientists specialize in the analysis of data rather than the development of software. You'd hire a data scientist to look for interesting patterns in data, or create machine learning models, and other data analysis tasks. These tasks may involve writing code, but it's usually specific to data analysis, often in R or Matlab or similar. A lot like how many people in the natural sciences pick up coding to enhance their capability, but the software writing is a means to an end.

I wouldn't hire a data scientist to build a web app (well, I would if he or she had the necessary knowledge and skills - the job title wouldn't be "data scientist" though). "Software developer" is much closer to "programmer".

[+] mr_toad|7 years ago|reply

I think the point of the article was that it used to be more common back in the 60’s and 70’s for programmers to work on data problems. From basic stuff like census tabulation or designing file systems, to creating trigonometry or t-statistic tables, to AI.

There was less specialisation, less of a divorce between programmers and users.

There also seemed to be a conflation of computing and AI back then. Lisp was considered AI. And the early computing pioneers and theorists were strongly interested in AI, logic, and mathematics.

[+] sixdimensional|7 years ago|reply

This post states that a data scientist uses compact languages such as SQL and R.

Genuine question - do people really believe that being able to write and understand complex SQL makes you a data scientist?

I ask because, I've been writing some of the nastiest, most difficult looking SQL around for probably at least 15 years. And yet, I would NOT call myself a data scientist because I know and can work with data and use SQL. It might make me a data engineer.

What would make me a scientist is the process, method and rigor I apply to data-driven research and in practice. It's not about what tool I use or how complicated that tool is.

I often get a whiff of imposter syndrome over this because, if being "great at SQL and R" is enough to get the big bucks as a data scientist, then I'm clearly doing it wrong. But, then again, maybe I'm being too literal thinking that a scientist means something different.

[+] telchar|7 years ago|reply

I've been working as a data scientist for several years and have written some pretty gnarly looking SQL myself. I have a background in math and hard science so I have some understanding of the scientific method as well. While I respect our DBAs I wouldn't call any of them qualified to be data scientists.

While I have been able to hold my own in this job I went back to school to pursue a graduate degree (partly) because being in the field has shown me how much more there is to know. While it's easy enough to train a simple model in R there are so many ways to fool yourself and produce an invalid analysis and so many variations on otherwise-simple problems.

It seems this field has a lot of variation. A glorified report writer might get the DS title but they're not going to get the really cool jobs.

If you're interested in data science try out a kaggle competition and try to place high. The variety of methods and tricks people try to improve their entries can be illuminating, I think.

[+] thousandautumns|7 years ago|reply

Firstly, it states that "a data scientist uses compact languages such as SQL and R". It doesn't state "everyone who uses SQL is a data scientist".

That said, the term data scientist itself is a bit frustrating. It gets thrown around a lot as if it is a well-defined role, and it is anything but. In my experience, the role of a "data scientist" is about as well defined as the role of an "engineer": it has connotations about the type of work and maybe a few shared skills, but the specifics of what an "engineer" does and their skillset varies widely depending on if they are a software engineer, an electrical engineer, or a civil engineer.

So while I think that most data scientists know SQL or use SQL frequently, I don't think that all data scientists use it, nor do I think that everyone who uses SQL works in a role that would probably be considered that of a data scientist.

[+] WhompingWindows|7 years ago|reply

That covers the "data" aspect, for my work however, the "scientist" aspect is just as important. While I'm expected to use SQL and R to generate reports, I need the thought process of an epidemiologist to construct my analytic samples. I also require the scientific knowledge and background to interface with MDs and clinical PhDs, who need me to bridge the gap between data and science.

[+] talltimtom|7 years ago|reply

> do people really believe that being able to write and understand complex SQL makes you a data scientist?

Many data scientists use R and SQL, that does not mean that many of those who use R and or SQL are data scientists.

Many lawyers use word. Yes I’m not a lawyer just because I use word.

[+] jiggunjer|7 years ago|reply

Your second sentence does not follow from your first. Just because Y's do X, doesn't mean everyone who does X is a Y.

[+] anoncoward111|7 years ago|reply

youre being too hard on yourself and you should go apply for the big bucks. most scientists barely deserve the title

[+] booleandilemma|7 years ago|reply

Sure, we’ve stolen the term “engineer” for long enough, let’s bother the scientists now.

[+] 3chelon|7 years ago|reply

I've interviewed a few "data scientists". Some of them were pretty arrogant. Their idea of a "close to the metal" language was Numerical Python. I don't think these guys are going to be writing the next generation of OS anytime soon.

[+] danso|7 years ago|reply

This is such a bizarre post. The reason why people use a language like R is because it is easy to learn and use (and install, via RStudio) for data analysis without having to be a well-trained programmer. I can’t recall ever hearing from anyone who has relied on R, doing so because it was computationally efficient. The point of the language is convenience — particularly with how easy it is to create attractive graphics using ggplot2’a defaults.

It’s a testament to the R library’s developers (particularly Hadley Wickhan) for making APIs that do so well in streamlining data work. But I’m willing to bet a majority of R users, particularly in academia, could not load a simple delimited data file without a high-level call such as read.csv.

(By “simple”, I mean a delimited text file that could be parsed with regex or even split. I don’t expect the average person to be able to write a parser that dealt with CSV’s actual complexity)

[+] EdwardDiego|7 years ago|reply

The fact that R has such buy-in despite being a rather awful programming language (a friend of mine worked on the next Lisp-like version of R under Ross Ihaka, and the next version is based on the fact that current R is a bit awful) is precisely because it offers such convenience to non-programmers.

In my sister company, they have data scientists, and data engineers. The data scientists write their algorithms in the language they're most comfortable with (typically JS), and the data engineers rewrite to perform efficiently in the application that's applying them.

Data scientist and programmer are two very different specialisations.

[+] minimaxir|7 years ago|reply

A trend I've been noticing (especially as ML/AI tooling becomes more accessible) is that people believe the quality of data science code and workflows is proportionate to its complexity/LOC (since complex problems require complex code, right?).

It's a toxic perspective that ignores recent and pragmatic innovations in the field.

[+] pizzazzaro|7 years ago|reply

I think the post is referring to some idea of "glamour" or the lucrative nature of a rapidly emerging field.

Meanwhile, both demand that the employee spend all day telling a computer what to do.

[+] aogl|7 years ago|reply

In my company I am a software engineer and my colleague is a data scientist, our current project that we work together on does a lot of NLU and NLP type work (think bots) and our skillsets often don't overlap and are both equally valuable to the projects success. That is, I tend to write the infrastructure and platform code that ties everything together and deal with all the software engineering type work, while my data scientist feeds in trained models and the likes. Both are necessary to handle contractual requests/responses as per our scope design.

[+] gnulinux|7 years ago|reply

My experience is very similar to this as a "software engineer" in a company who has 50% 50% split software engineers and data scientists.

[+] DonHopkins|7 years ago|reply

There was a sign on the door to the Vax Lab at the University of Maryland that said "Department of Research Simulation".

[+] dekhn|7 years ago|reply

A data scientist is just a statistician who works in the Bay Area.

[+] screye|7 years ago|reply

As someone who has been looking for Data Scientist jobs in the past few months, I can reliably say that the term can mean everything from software engineer for big data systems, SQL guy or a person that builds complex machine learning models.

It is just as vague as the job profile of a "programmer". In that sense, the title is right. But, in the context of the article's content, I disagree.

The job done by a data scientist in demanding roles, requires a strong grasp on undergrad level statistics. But because of the recent trends towards ML, the person also needs to have a strong grasp of linear algebra, vectorization and software engineering / undergrad algorithms.

While it is unlikely that one data scientist may need to summon the whole skill set, an interviewee will never know which subset of these skills you will be asked to demonstrate to get hired.

Modern software jobs have figured out distinct subset of skills needed to differentiate between different software roles for experienced employees. Junior level employees are barely even expected to know anything other than algorithms, data structures and high level system design (at least during interviews)

Another funny observation (anecdotal) is there seem to be more openings for "senior data scientist" (who is expected to know everything), than "junior data scientists" whom the company is willing to mentor.

As of now, I find myself scrambling to decide which skills I need to prioritize, often feeling like I am being pulled in opposite directions. Almost of which require formal instruction (the maths), and can't be picked like software skills through youtube and online projects. This isn't a knock against software, just different type of subject matter.

Companies interviewing for these roles may ask everything from leetcode algorithms questions to statistics to questions about modern ML algorithms and domain specific models (in NLP, Vision, finance, recommenders)

I personally find a "junior" Data Scientist's role (in expectations) to be harder than that of a junior SDE. There is a reason many these jobs will put phD into preferred qualifications. It is ironic that there has been such a massive surge of people without the necessary background, who do a couple of MOOCs and crown themselves data scientists. Being good at any software & math heavy domain is hard. Data Science is no exception.

[+] rawoke083600|7 years ago|reply

Forgot who said it but it was great: a "Data scientist" is a programmer better at stats than any 'normal' programmer and better at programming than any 'normal' statistician." :P

[+] jeffreyrogers|7 years ago|reply

There is a good comment on the original article by a user named LauraConrad. I'm excerpting it so HN readers will see it:

> I was a “Programmer” in the ’70’s, and I keep thinking how much of what my early programs did would be done by a spreadsheet now (or any time since the late ’80’s).

[+] alkonaut|7 years ago|reply

I thought data scientist was closer to doing the work of a statistician than a progrrammer. Visualizing data, and analyzing data. Programming becomes part of it by necessity.

Data science is also a much sexier term than statistics, just like "Machine learning" and "Artificiall intelligence" is a lot sexier than, say, "Regression".

As someone funnier than me put it: "A data scientist is just a statistician with a mac".

[+] theoh|7 years ago|reply

The basic premise of the article is that

systems programming=irrelevant bloat and abstraction

while

data reduction=definite purpose and utility

People writing python notebooks to do data analysis are probably fairly comparable to the scientific computing programmers of the past, but I feel like this picture tends to dismiss the computer science side of systems programming:things like GUIs, network code, processes and virtual memory, all the architectural aspects of computing.

One might prefer APL or Forth for writing one-page programs, and it's probably true that systems now are bloated relative to what they could be. Still, there is much of interest going on in a typical operating system, compiler or video game, while a typical data analysis notebook is IMO fairly dull and even basic, from a software angle.

[+] andyburke|7 years ago|reply

Yeah, the author's take is myopic. What they call bloat, people from the 70s would call wondrous: ubiquitous networking with and without wires, beautiful graphical interfaces, encryption everywhere (and expanding), far more open systems than proprietary re-engineered ones, the list goes on and on.

[+] coryfklein|7 years ago|reply

If it means we now get a term that, at least for a couple of years, filters out all the garbage roles recruiters throw at me then I'm on board with adopting this terminology.

239 comments