My experience is in quant hedge funds, where sometimes you get some guys who develop the strategy and some guys who put it into production.
Yes, I do admit there can be some specialization in terms of time spent on science vs engineering.
But you really need people who understand both. Particularly if you have a strategist who thinks his job is just to dream up profitable models, he ends up carving that role out in a way that's detrimental to the rest of the team. You get people who just don't appreciate that there's other work to do than finding models, and that models depend on that other work to function.
You also get a huge prestige gap, because inevitably management will think that there's a magician and a blacksmith. One guy needs to be paid a lot, and the other guy needs to be paid enough.
These effects feed each other. Magician will say "where's my data" and expect blacksmith to make it, promptly. He won't do it himself, because spending time on mundane stuff makes the magic disappear. And not doing it yourself, or taking the time to understand it, will eventually lead to problems with the magic.
> Particularly if you have a strategist who thinks his job is just to dream up profitable models, he ends up carving that role out in a way that's detrimental to the rest of the team.
My god, this. These people make me bonkers. Especially because I feel like I have a bit of this tendency myself, the desire just to think big thoughts and do no actual work. Happily, I long ago learned that ideas were approximately worthless without labor, and that I anyway had much better ideas when laboring because it forced me to engage with the details.
And yes, those people can poison a team. My best working experiences have all been with people who a) all valued actual work and b) believed that everybody could have good ideas.
Also, a lot of data scientists find the science fun and the engineering boring. But they have overlapping skill sets - if you aren't good at one, you're probably not good at the other either. Somebody who shows up to a team with the goal of only modeling and pushing all the dirty engineering work to their teammates is basically a worst case scenario because
1) They probably aren't going to produce good models since they're not sensitive to data nuances, but now they've taken over ALL the modeling work.
2) They bring down the job satisfaction of everyone else on the team who would like to be doing at least some modeling.
3) They're sucking up the prestige that should be distributed over the entire team and management thinks they should be paid more for work that it turns out everybody thinks is more fun anyway.
My number one advice to entry level data scientists is to not be this guy. Don't give your interviewers the impression that you won't do your own engineering work because they won't want someone who brings negative value to the team.
I worked in investment banking (as an analyst, not an engineer), so very different part of finance, but this was my take as well. Companies might love to talk about how important engineers are, but at the end of the day, if you can't directly link someone to revenue, they get viewed as a cost center and take on second tier status in the organization. Then the same companies complain that they can't find enough (or retain) engineering talent. Not many places get the balance right. Silicon valley treats engineers well because for the most part, the value they bring is more obvious (and also, they don't threaten the existing hierarchy in the company). Curious to hear if anyone has had the opposite experience.
I'm in industrial automation, but it's much the same. Projects where someone developed a strategy but has never been involved in the details of a machine are doomed to failure (or at best to be unreliable and producing low quality parts). Projects built by machine fabricators are over-engineered, frequently late, and sometimes unprofitable, but damn if they don't work well.
The main trouble, I think, is that when a shiny new contraption is brought to the king, it's too often the magicians doing the talking - whether they're speaking words of power or Common, their job is to talk. Meanwhile, the blacksmith is probably busy at in his workshop some ornate scroll work for the next thing, or repairing the previous gizmo, because he'd rather be hammering away at his anvil than talking.
The higher you go in an org chart, the fewer the number of people who understand the work their company actually does, and the more voices you have between the workers and the decision-makers to take some of the credit for work as it passes up the chain.
To add, quants that can't do the data engineering work are always crappy quants. I haven't seen a counter-example to that. Profitable models aren't going to be delivered on a silver platter. They need to be able to process pretty low level data effectively and build ad-hoc custom tools and data pipelines around that to test out their ideas. Otherwise they're constrained to the tools others have built and that massively narrows the search space that they're capable of traversing.
The best quants are 1/3 statistician, 1/3 developer and 1/3 trader, in my view.
What must be communicated to management: It is easy to find other magicians. It is not easy to find another blacksmith. Without the right blacksmith, there can be no magic.
Magicians will be magicians, always hustling (bullshitting), but they will never have the value and job security of the blacksmith. The blacksmith can see the fruits of her own labour, whilst the magician must lie to herself and others in order to claim the blacksmith's value as her own.
If the blacksmith is good enough, she will earn the trust of management and management may consult the blacksmith in the selection of magicians. Management may ask the blacksmith to interview magicians and seek her advice on the final hiring decision.
The blacksmith may not carry the "prestige" of the hustling, bullshitting magician but she can command a high salary and dictate her own working conditions. This is only if management understands her value. What the magician thinks of the blacksmith is irrelevant.
Reliable blacksmiths are hard to find. Magicians are a dime-a-dozen.
I see this same attitude about TDD adoption - teams in my company say things like “testing is for lackeys / that work is beneath us”, I.e. they see that as the responsibility of QA testers who are less important in their view. This is short sighted, arrogant and encourages similar problems with superiority complexes. TDD is still controversial in some circles, but engineers who have a deep understanding of both tests and implementation are far more valuable than those who only understand one side. Anyway, sorry for the somewhat off topic rant, but a lot of what you said resonated with me
I think this insight exists across a lot of fields. Basically, if you want to be a really excellent magician you also better be a decent blacksmith. More concretely in this case, if you’re unable to do the data “engineering” yourself then it will close a lot of doors for interesting and novel work on the “science” side. Beyond that, if the scientist’s job just involves gluing sklearn models together I think that job is more on the engineering side of things than the supposed scientist usually wants to admit.
This problem only grows as the company scales and the science and engineering pieces are formally split along some role guideline.
Inevitably if you treat a job role as a support role, you'll attract weaker individuals into that role then you would get if it wasn't considered a support role. The problem with Science oriented teams is that all roles other than the science role morph into science support roles over time. The same pattern used to occur with Engineers and QA, or Engineers and ops.
How do you achieve people like this? From my limited experience (college senior joining a hft firm shortly, so I've recently been in several quant finance SWE interview loops), firms seem to vastly downplay the financial aspects of the job for software engineers. Compounded on top of that, firms don't expect or encourage financial backgrounds for engineers (at least new grads)- the expectation is that whatever limited financial background we'll need to work will be given to us when it becomes necessary.
Is this because it's easier (obviously) to teach a quant engineering than it is to teach an engineer quant finance? Or rather because it's expected now that traders will become the bridge between researcher models and implementation, and engineers will simply provide the underlying infrastructure to power these implementations?
As I see it you need people who have shallow knowledge of many areas and deep knowledge of one area. That lets you have a group of experts but ones that know enough about other areas of expertise to work with those other experts.
this perception of classes within engineering is the greatest frustration of my career. People with a PhD or “scientist” in their title are not more valuable than engineers who end up being the ones to get things to work.
From experience, the magician will take every chance to make this divide greater, and sell their expertise, rather than grow with (and help grow in domain) the blacksmith skills.
You end up with magic: closed siloed knowledge.
“How would the blacksmith ever understand magic?” was often something thrown around by magicians at meetings.
The (repetitive) blacksmith role is not an interesting one, digital revolution needs to come into place. Architects that build tools, self service systems are much more interesting.
That's interesting - I just completed book on Jim Simon/Renaissance (The Man Solved The Market). One of their early advantages was having a person who was just focused on acquiring and cleaning data. I expect that advantage has largely gone away at this point due to wide availability of market data but I thought it was interesting in the context of this article.
The data lifecycle is waaay overpopulated with Data Scientists who are not empowered or knowledgeable enough to work with product designers and engineers to do everything that empowers Data Science and ML.
We need more Data Engineers involved at time zero in projects to help:
1. Plan out what data should be produced/captured by the product
2. Instrument systems to actually generate data consistently and effectively
3. Build ETL pipelines and data management systems
4. Manage enterprise data sharing and resiliency
etc...
What ends up happening is you have a bunch of Data Scientists just handed a pg_dump or flat file from some ops team. That is typically missing data or poorly formatted and they spend 90% of their time cleaning it up then running some basic regression with numpy or whatever.
Need better understanding of the data lifecycle by organizations and investment in instrumentation and data management.
> What ends up happening is you have a bunch of Data Scientists just handed a pg_dump or flat file from some ops team
Not to disparage the amazing data scientists I've worked with, but I've been on teams where this is very much the approach to operationalizing models. It's basically, "Here's the sklearn model and some fragile featurization scripts we built. Can you take this to prod ASAP?"
The problem I've seen is that DS & DE teams were in different parts of the org and had their own sprints that were in no way connected. So they kept chucking models over the wall and we kept trying to faithfully operationalize. Once we convinced leadership that we had to collaborate from the get-go, things went a whole lot better. It also improved the working relationship of engineers and scientists.
I learned a hell of a lot from the scientists; they learned how to write better code. They also learned what code they didn't need to write because I could do it faster or better than them, leaving them to focus on more important things. It was pretty amazing to find what manual processes they would setup in lieu of proper (or even any) engineering support. Again, these are amazingly smart people, but they were being square-pegged into a lot of round-hole engineering tasks.
Now, the much more frustrating issue I had was being in a very data-heavy organization and being told by a distinguished engineer (my skip-level) plus my direct manager that, "data engineering isn't a real discipline." I left that org very shortly thereafter.
>The data lifecycle is waaay overpopulated with Data Scientists who are not empowered or knowledgeable enough to work with product designers and engineers to do everything that empowers Data Science and ML.
Reading this thread has made me realize just how lucky I am to work very closely with strong a very strong Data Scientist, who is complemented by a very strong Data Engineer. Conversations with the Data Scientist are always about strategy, product alignment, and ensuring we're optimizing what we build for learning. The Data Engineer works very closely to ensure we're actually capturing the data we think we are, getting it to analysis systems, and making sure those data pipelines stay healthy.
I can't recommend the Data Engineer career enough for junior developers. It's how I started and what I pursued for 6 years (and I would love doing it again), and I feel like it gave me such an incredible foundation for future roles :
- Actually big data (so, not something you could grep...) will trigger your code in every possible way. You quickly learn that with trillions of input, the probabily to reach a bug is either 0% or 100%. In turn, you quickly learn to write good tests.
- You will learn distributed processing at a macro level, which in turn enlighten your thinking at a micro level. For example, even though the order of magnitudes are different, hitting data over network versus on disk is very much like hitting data on disk versus in cache. Except that when the difference ends up being in hours or days, you become much more sensible to that, so it's good training for your thoughts.
- Data engineering is full of product decisions. What's often called data "cleaning" is in fact one of the import product decisions made in a company, and a data engineer will be consistently exposed to his company product, which I think makes for great personal development
- Data engineering is fascinating. In adtech for example, logs of where ads are displayed are an unfiltered window on the rest of humanity, for the better or the worse. But it definitely expands your views on what the "average" person actually does on its computer (spoiler : it's mainly watching porn...), and challenges quite a bit what you might think is "normal"
- You'll be plumbing technologies from all over the web, which might or might not be good news for you.
So yeah, data engineering is great ! It's not harder than other specialties for developers, but imo, it's one of the fun ones !
The other thing I'd emphasize here is dealing with "state". Data is effectively state.
As application engineers build increasingly "stateless" code (e.g. pure functions, serverless deployments, etc), that state gets pushed elsewhere. Someone has to manage the queues, file versions/locations, logs, databases, configurations and so on. That is all "data".
State management is a tricky problem even in a single-threaded application. It's doubly so in distributed systems, where state can be inconsistent between all the moving pieces. This is the source of endless data integrity issues. I think data engineering is a great way to get some exposure to all of this.
A couple of us inherited a machine learning project a while back. The code was horrible. Riddled with copy pasta (nearly half of the entire thing was copy paste and no code reuse). We basically refactored everything, standardized input and output file names. We put up a small Flask service to allow outside services hit it easily and wrapped it up in a Docker container so it was ultimately easy to deploy. Yes it was all the plumbing. However we also looked at the code, and the ML strategies, and while there was "some" level of competence, it was nothing more than word2vec add and divide. Totally horrible for actually finding key phrases that matter to the subject we're matching. So we started tackling that too with LSTM but our time got cut short and shifted off to another area. So not only was the "scientist" they hired completely crappy at the engineering, they weren't really helpful in the ML either.
This is obviously of lesser value to the topic at hand, and more about making sure you hire good people I think.
I always felt that tech-focused data scientists should also be required to know how process data end-to-end; at minimum, from a SQL database to deployed model, but knowing how to collect & clean data is important too. It seems like the industry is trying fill the gap that was created by a glut of people without math/cs backgrounds going into 5-week data science courses who then need hand-holding when they get real jobs.
Data science & engineering should be treated as a single collection of skill-sets. Lacking ETL experience is a major deficit, considering how prevalent that kind of work is.
This might just be my personal biases coming through. I consider myself a "full-stack" data scientist & engineer. But because data scientists who can work on the backends are rare, I always end up doing the plumbing while other people do the fun analysis work.
I think companies that are data "science" heavy are going to be at huge disadvantage soon. Tools like Rekognition and Google AI APIs are making the model training & deployment aspect almost trivial. At some point, the only real work involved in this space will be the data "engineering."
I teach engineers for a living. I struggle to see how this is not just a straw man argument based on colloquial usage of terms. It is just inferences drawn based on job ads that are rarely written by people doing the job and instead are effectively human-as-seo-optimized so the best candidates can find the job they hopefully fit for and not be too confused to apply for it.
I'm late to the comment party, but: this is classic "commoditize your complement".
This guy would have you believe that Pytorch has Solved the entire, vast field of data analysis as inherited from Newton, de Moivre, Laplace, Bayes, Fisher, Neyman, Pearson, Wald, Savage, Jaynes, Breiman, Pearl.
This is a lot like saying that photography has Solved art, and now we need people who can climb ladders and glue the posters on them big billboards. It would be delusional if it didn't have a self-interested angle.
What, we with math degrees are fully confident that the plumbing problem is easier to commoditize than the problem of making sense of data.
One annoying thing about being a generalist is that domain experts in any given area that you need familiarity with can't help but complain about how little you know about that domain, ignoring the fact that your job requires equally deep knowledge of several other domains simultaneously.
In the case of data scientists, I think the business folks that want them to understand the business domain better generally have the strongest argument, followed by the statisticians - good data scientists need to personally understand both of those things well, while the engineering and ops stuff that data scientists are also expected to do is easier to compartmentalize on other teams. So I agree that we should have more data engineers, but apparently for the opposite reason as most people in this thread.
Having to deal with data scientists, I absolutely agree. The thing that I've seen that lands in the "lab" vs production distinction is that these people expect their data to be pristine. They flip out when the world isn't as perfect as their models want. Leads to me as just a normal software developer having to do the data analysis and figure out how to clean it up.
I also end up having to be the one to talk to data vendors to understand their data feeds and essentially translate that for the data scientists. Having to sit in the middle is annoying for me and suboptimal for the business.
Genuine question: why is there so much pure teeming hatred for data scientists in this comment thread? Almost every comment comes off as full of snark and vitriol against data scientists.
My view is from a small startup with little to no room for single purpose employees.
When I first started hiring and working with data scientist my view was this: If you can only manipulate data and run it through pipelines to generate models then you can't do enough to be highly valuable. You either need to have a strong enough background in CS to build the pipelines / tools or a strong enough mathematics background to be able to propose cutting edge new ideas. From my experience it is hard to find someone who has one of these skill just from a University "data science" program. At a small company (at least ones that I have worked with) being only proficient in R and basic Python isn't enough. That being said, I have met and handful of Data Scientist who were very smart and self motivated enough to pick up on the lacking skills when given the chance.
My question to HN is this; are there rolls at these larger companies for a Data Scientist who who primarily just crunches data in R and Python without the ability to actually build the pipelines / tools or conduct research?
I would be cautious about that. I've worked in the startup space for over 10 years now as a data scientist, often the first one hired on, working on the pipes.
From my experience, there are two types of data scientists who work who do infrastructure work: 1) Those who do not make the best data scientist because their skill set is too far in engineering land, leaving them weak where it counts. If the startup is relying on the data scientist to be profitable, I'd be cautious with these types. or 2) Someone who is senior, beyond senior really, who has worked both jobs, and doesn't mind doing both jobs. This unicorn is so rare it is mythical. The joke when the terminology was created is they're so rare no one has ever seen one, hence unicorn.
Me, I can not do the work I need to do if I'm on call. That is where I draw the line. That means hiring someone to monitor the infrastructure. Furthermore, I'm an okay architect, but you really do want to hire a specialist if you can help it for that. Do I help them with the infrastructure? Absolutely, but they're on call if a server is on fire. They have the admin login credentials, not me.
I get wearing multiple hats, but keep in mind to be a data scientist you're already wearing multiple hats. Being a data scientist is like double majoring and getting a phd. At what point are they stretched too thin? The consensus in the industry is they're already stretched too thin and should be broken up into different specialized roles.
>My question to HN is this; are there rolls at these larger companies for a Data Scientist who who primarily just crunches data in R and Python without the ability to actually build the pipelines / tools or conduct research?
That is the standard role, even at startups. However, the industry consensus these days is data scientists should have more responsibility when it comes to deploying models than previous standards.[1] So data scientists are being pushed in a more engineering direction, not with hosting sql servers and infrastructure, but with working with engineers to make sure the models are monitored properly. This change comes from model deployment being further automated as time goes on, making it easier for the data scientist to have more responsibility during this stage.
There are certainly roles out there for a Data Scientist who just crunches numbers. A good friend of mine does exactly that for a large traditional retail corporation. Just by using standard ML tools he replaced a whole team of analysts for pricing items. Maybe not in cutting edge tech companies, but roles like that are all over the economy still.
4 years ago I moved from a role where I primarily wrote C# as an architect on a web application, to an architect helping to build a data warehouse. The contrast in tooling, discipline and information available to build anything in the data world is so stark it had me questioning my career decisions. Sure, you can read Kimball and Inmon and I'm sure there are a handful of others out there - but there are drastically fewer than what you can find in the application development space.
Things are getting better, Visual ETL tools are falling out of favor to proper coded ETL (spark, dbt, etc) and data teams are starting to see the value of actually engineering a solution instead of just throwing it over the wall to a DBA to deal with. But tooling, and general information on the web is still lacking. Pushing data engineers over "etl developers" or "bi developers" (or "data scientists") will drastically improve any organizations ability to actually deliver real analytics and hopefully an industry wide push will raise all ships.
My first job as a "Data Scientist" (it wasn't called that, but the work was the same) was for a small gaming shop, around 2011. It involved applying econometric analysis and doing simple statistical testing on the player data sets. I realized quickly that knowing how to do statistical testing was only a very small portion of what it took to create value in such a role. At the time, I didn't even know (but learned) SQL. Everything I wanted to do involved teaming up with a developer, which wasn't efficient in a small operation. So I learned to program. I continue to enjoy skilling-up, most recently learning cloud-tech to enable me to deploy data tools I develop.
The most valuable people in the data chain will be those that can take idea to near-production. Running ML libraries over clean datasets is overrated. The fact is, 80% of the value of "Data Science" comes from KPIs and basic stuff.
Agree with the general idea. Data engineers are like the essential workers of the data world, people who today may not receive the appropriate level of appreciation that they deserve.
I think the glut of data scientist occurs because we clump so many different skills and disciplines under the single term "data scientist". Data scientists today come from so many different backgrounds that the definition means something different to everyone. Because of this, the surface area of possible skills that could be expected of a data scientist is vast, to the point where it's pretty unlikely to be sufficiently competent in all of them, let alone a majority.
I'd like to go back to a world where we had a little more specificity about what kind of data scientist you are (e.g. I had no problem with terms like statistician and data miner), which could help ground expectations that others have of us, and it'd also help clearly define the scope of various career paths for the next generation.
Agreed. In my decently long career the types of data problems I've seen be most impactful on the business are not head-in-the-clouds ML issues, but more mundane yet more far-reaching:
1. Appropriately identifying what data needs to be captured from a product to correctly operationalize it.
2. Understanding and modeling data structures in internal applications to identify and tune backend data storage mechanisms (including DBMS). Inclusive in this is helping the application development team pick the correct structure and implement it correctly.
3. Validating implementation of instrumentation within the application so that data cleaning isn't necessary and that telemetry can be appropriately reported on. Building said reports.
4. Doing ETL and taking care of out of band data management to link disparate systems within the business to help build holistic views of the business overall.
5. Be a safeguard against the over-collection of data, because data engineers understand that data isn't an asset, it's a liability that increase costs and risks as a business or product scales, and when there's not a specific need that can be articulated clearly for that data, collecting it is a user/customer-hostile action.
My experience has been that data is a crucial element to understand the health and state of the business with both breadth and depth at a given point in time and identify trends. However, it's mostly used by folks in management as a crutch to try to de-risk decision making, or worse as a political tool to give a faux support to a decision that's already been made but not yet publicized. Decisions carry inherent risk, including the decision to do nothing, you cannot eliminate this, it's one of the components of decision trade-offs. This sort of broken use of data by management is supported by "Data Scientists" that see the field as a cash-cow they can milk while they work on pie-in-the-sky ML strategies which are often unnecessary, even when they actually work.
Done correctly a strong data culture in a company can increase decision velocity, empower engineers, and reduce overhead on management to understand the business. Done improperly, data culture in a business can easily destroy decision velocity, empower dysfunctional politics, and increase engineering overhead to understand systems. Getting it right is the main test for businesses in the new era.
I recommend the book: Agile Data Science by Russell Jurney[1]. The tech stack is circa 2017, but the chapters on the Agile Data Science Process and Teams are timeless.
He clearly articulates team roles: from Biz Devs, marketers, PMs, UX designers, UI designers, Web Developers, API Engineers, Data scientists, Applied researchers, Platform/Data engineers, QA engineers, DevOps Engineers.
Then he talks about different ways to increase agility by combining these roles into generalists empowered to iteratively explore the "pyramid of data value" until the right product-market fit is found.
Building Data-science Intensive Web Applications is inherently waterfall, not agile, and I find this book to be a fascinating reference.
I'm a data engineer for most of my day right now, and a lot of it is done with ruby/python/shell scripts into postgres DBs.
What learning path should I go down? I'm a solo actor at work with a lot of agency to decide my workflows.
I see myself building small to medium size data collections over the next year or two at my job.
Can someone point me to some learning?
I have a CS degree etc. and my title in software engineer etc. etc.
End users of my data usually like their data as a CSV that is then read using R or Python. However there is also a use case where I will build an app to view my data in a simple way.
All of this is completely doable with my current knowledge/workflow but I can't help feel like I do a data engineering job with very different tools than i see "data engineers" speaking about online.
As an industry we're letting history repeat itself and making all the same mistakes.
There are different kinds of developers. At it's most base form, you have systems focused developers and algorithmic focused developers. Sure there is a grey area but I think those two buckets are pretty defensible.
In the data science world you have an exact parallel. Those who build the systems and those who optimize the thing the system supports.
In the ML world you have another parallel. Those who build the systems and those who optimize and pioneer the model architectures and parameters.
We never reached consensus on the titles for different kinds of developer/programmer/computer scientists. And we're failing now to reach consensus on sane titles for ML and DS.
[+] [-] lordnacho|5 years ago|reply
Yes, I do admit there can be some specialization in terms of time spent on science vs engineering.
But you really need people who understand both. Particularly if you have a strategist who thinks his job is just to dream up profitable models, he ends up carving that role out in a way that's detrimental to the rest of the team. You get people who just don't appreciate that there's other work to do than finding models, and that models depend on that other work to function.
You also get a huge prestige gap, because inevitably management will think that there's a magician and a blacksmith. One guy needs to be paid a lot, and the other guy needs to be paid enough.
These effects feed each other. Magician will say "where's my data" and expect blacksmith to make it, promptly. He won't do it himself, because spending time on mundane stuff makes the magic disappear. And not doing it yourself, or taking the time to understand it, will eventually lead to problems with the magic.
[+] [-] wpietri|5 years ago|reply
My god, this. These people make me bonkers. Especially because I feel like I have a bit of this tendency myself, the desire just to think big thoughts and do no actual work. Happily, I long ago learned that ideas were approximately worthless without labor, and that I anyway had much better ideas when laboring because it forced me to engage with the details.
And yes, those people can poison a team. My best working experiences have all been with people who a) all valued actual work and b) believed that everybody could have good ideas.
[+] [-] noodlenotes|5 years ago|reply
1) They probably aren't going to produce good models since they're not sensitive to data nuances, but now they've taken over ALL the modeling work.
2) They bring down the job satisfaction of everyone else on the team who would like to be doing at least some modeling.
3) They're sucking up the prestige that should be distributed over the entire team and management thinks they should be paid more for work that it turns out everybody thinks is more fun anyway.
My number one advice to entry level data scientists is to not be this guy. Don't give your interviewers the impression that you won't do your own engineering work because they won't want someone who brings negative value to the team.
[+] [-] chadash|5 years ago|reply
[+] [-] LeifCarrotson|5 years ago|reply
I'm in industrial automation, but it's much the same. Projects where someone developed a strategy but has never been involved in the details of a machine are doomed to failure (or at best to be unreliable and producing low quality parts). Projects built by machine fabricators are over-engineered, frequently late, and sometimes unprofitable, but damn if they don't work well.
The main trouble, I think, is that when a shiny new contraption is brought to the king, it's too often the magicians doing the talking - whether they're speaking words of power or Common, their job is to talk. Meanwhile, the blacksmith is probably busy at in his workshop some ornate scroll work for the next thing, or repairing the previous gizmo, because he'd rather be hammering away at his anvil than talking.
The higher you go in an org chart, the fewer the number of people who understand the work their company actually does, and the more voices you have between the workers and the decision-makers to take some of the credit for work as it passes up the chain.
[+] [-] hntrader|5 years ago|reply
The best quants are 1/3 statistician, 1/3 developer and 1/3 trader, in my view.
[+] [-] 1vuio0pswjnm7|5 years ago|reply
Magicians will be magicians, always hustling (bullshitting), but they will never have the value and job security of the blacksmith. The blacksmith can see the fruits of her own labour, whilst the magician must lie to herself and others in order to claim the blacksmith's value as her own.
If the blacksmith is good enough, she will earn the trust of management and management may consult the blacksmith in the selection of magicians. Management may ask the blacksmith to interview magicians and seek her advice on the final hiring decision.
The blacksmith may not carry the "prestige" of the hustling, bullshitting magician but she can command a high salary and dictate her own working conditions. This is only if management understands her value. What the magician thinks of the blacksmith is irrelevant.
Reliable blacksmiths are hard to find. Magicians are a dime-a-dozen.
[+] [-] schanq|5 years ago|reply
[+] [-] oivey|5 years ago|reply
[+] [-] lumost|5 years ago|reply
Inevitably if you treat a job role as a support role, you'll attract weaker individuals into that role then you would get if it wasn't considered a support role. The problem with Science oriented teams is that all roles other than the science role morph into science support roles over time. The same pattern used to occur with Engineers and QA, or Engineers and ops.
[+] [-] leapis|5 years ago|reply
Is this because it's easier (obviously) to teach a quant engineering than it is to teach an engineer quant finance? Or rather because it's expected now that traders will become the bridge between researcher models and implementation, and engineers will simply provide the underlying infrastructure to power these implementations?
[+] [-] marcinzm|5 years ago|reply
[+] [-] p5a0u9l|5 years ago|reply
[+] [-] Borlands|5 years ago|reply
The (repetitive) blacksmith role is not an interesting one, digital revolution needs to come into place. Architects that build tools, self service systems are much more interesting.
[+] [-] inthewoods|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] zzzeek|5 years ago|reply
[+] [-] kyawzazaw|5 years ago|reply
[+] [-] notretarded|5 years ago|reply
[deleted]
[+] [-] AndrewKemendo|5 years ago|reply
The data lifecycle is waaay overpopulated with Data Scientists who are not empowered or knowledgeable enough to work with product designers and engineers to do everything that empowers Data Science and ML.
We need more Data Engineers involved at time zero in projects to help:
1. Plan out what data should be produced/captured by the product
2. Instrument systems to actually generate data consistently and effectively
3. Build ETL pipelines and data management systems
4. Manage enterprise data sharing and resiliency
etc...
What ends up happening is you have a bunch of Data Scientists just handed a pg_dump or flat file from some ops team. That is typically missing data or poorly formatted and they spend 90% of their time cleaning it up then running some basic regression with numpy or whatever.
Need better understanding of the data lifecycle by organizations and investment in instrumentation and data management.
[+] [-] mynameisash|5 years ago|reply
Not to disparage the amazing data scientists I've worked with, but I've been on teams where this is very much the approach to operationalizing models. It's basically, "Here's the sklearn model and some fragile featurization scripts we built. Can you take this to prod ASAP?"
The problem I've seen is that DS & DE teams were in different parts of the org and had their own sprints that were in no way connected. So they kept chucking models over the wall and we kept trying to faithfully operationalize. Once we convinced leadership that we had to collaborate from the get-go, things went a whole lot better. It also improved the working relationship of engineers and scientists.
I learned a hell of a lot from the scientists; they learned how to write better code. They also learned what code they didn't need to write because I could do it faster or better than them, leaving them to focus on more important things. It was pretty amazing to find what manual processes they would setup in lieu of proper (or even any) engineering support. Again, these are amazingly smart people, but they were being square-pegged into a lot of round-hole engineering tasks.
Now, the much more frustrating issue I had was being in a very data-heavy organization and being told by a distinguished engineer (my skip-level) plus my direct manager that, "data engineering isn't a real discipline." I left that org very shortly thereafter.
[+] [-] fatnoah|5 years ago|reply
Reading this thread has made me realize just how lucky I am to work very closely with strong a very strong Data Scientist, who is complemented by a very strong Data Engineer. Conversations with the Data Scientist are always about strategy, product alignment, and ensuring we're optimizing what we build for learning. The Data Engineer works very closely to ensure we're actually capturing the data we think we are, getting it to analysis systems, and making sure those data pipelines stay healthy.
[+] [-] C4stor|5 years ago|reply
- Actually big data (so, not something you could grep...) will trigger your code in every possible way. You quickly learn that with trillions of input, the probabily to reach a bug is either 0% or 100%. In turn, you quickly learn to write good tests.
- You will learn distributed processing at a macro level, which in turn enlighten your thinking at a micro level. For example, even though the order of magnitudes are different, hitting data over network versus on disk is very much like hitting data on disk versus in cache. Except that when the difference ends up being in hours or days, you become much more sensible to that, so it's good training for your thoughts.
- Data engineering is full of product decisions. What's often called data "cleaning" is in fact one of the import product decisions made in a company, and a data engineer will be consistently exposed to his company product, which I think makes for great personal development
- Data engineering is fascinating. In adtech for example, logs of where ads are displayed are an unfiltered window on the rest of humanity, for the better or the worse. But it definitely expands your views on what the "average" person actually does on its computer (spoiler : it's mainly watching porn...), and challenges quite a bit what you might think is "normal"
- You'll be plumbing technologies from all over the web, which might or might not be good news for you.
So yeah, data engineering is great ! It's not harder than other specialties for developers, but imo, it's one of the fun ones !
[+] [-] alexpetralia|5 years ago|reply
As application engineers build increasingly "stateless" code (e.g. pure functions, serverless deployments, etc), that state gets pushed elsewhere. Someone has to manage the queues, file versions/locations, logs, databases, configurations and so on. That is all "data".
State management is a tricky problem even in a single-threaded application. It's doubly so in distributed systems, where state can be inconsistent between all the moving pieces. This is the source of endless data integrity issues. I think data engineering is a great way to get some exposure to all of this.
[+] [-] pricci|5 years ago|reply
[+] [-] theflyinghorse|5 years ago|reply
[+] [-] secondcoming|5 years ago|reply
[+] [-] kjerzyk|5 years ago|reply
[+] [-] coding123|5 years ago|reply
This is obviously of lesser value to the topic at hand, and more about making sure you hire good people I think.
[+] [-] mywittyname|5 years ago|reply
Data science & engineering should be treated as a single collection of skill-sets. Lacking ETL experience is a major deficit, considering how prevalent that kind of work is.
This might just be my personal biases coming through. I consider myself a "full-stack" data scientist & engineer. But because data scientists who can work on the backends are rare, I always end up doing the plumbing while other people do the fun analysis work.
I think companies that are data "science" heavy are going to be at huge disadvantage soon. Tools like Rekognition and Google AI APIs are making the model training & deployment aspect almost trivial. At some point, the only real work involved in this space will be the data "engineering."
[+] [-] avs733|5 years ago|reply
[+] [-] prionassembly|5 years ago|reply
This guy would have you believe that Pytorch has Solved the entire, vast field of data analysis as inherited from Newton, de Moivre, Laplace, Bayes, Fisher, Neyman, Pearson, Wald, Savage, Jaynes, Breiman, Pearl.
This is a lot like saying that photography has Solved art, and now we need people who can climb ladders and glue the posters on them big billboards. It would be delusional if it didn't have a self-interested angle.
What, we with math degrees are fully confident that the plumbing problem is easier to commoditize than the problem of making sense of data.
[+] [-] tfehring|5 years ago|reply
In the case of data scientists, I think the business folks that want them to understand the business domain better generally have the strongest argument, followed by the statisticians - good data scientists need to personally understand both of those things well, while the engineering and ops stuff that data scientists are also expected to do is easier to compartmentalize on other teams. So I agree that we should have more data engineers, but apparently for the opposite reason as most people in this thread.
[+] [-] ziml77|5 years ago|reply
I also end up having to be the one to talk to data vendors to understand their data feeds and essentially translate that for the data scientists. Having to sit in the middle is annoying for me and suboptimal for the business.
[+] [-] Ansil849|5 years ago|reply
[+] [-] a_zaydak|5 years ago|reply
When I first started hiring and working with data scientist my view was this: If you can only manipulate data and run it through pipelines to generate models then you can't do enough to be highly valuable. You either need to have a strong enough background in CS to build the pipelines / tools or a strong enough mathematics background to be able to propose cutting edge new ideas. From my experience it is hard to find someone who has one of these skill just from a University "data science" program. At a small company (at least ones that I have worked with) being only proficient in R and basic Python isn't enough. That being said, I have met and handful of Data Scientist who were very smart and self motivated enough to pick up on the lacking skills when given the chance.
My question to HN is this; are there rolls at these larger companies for a Data Scientist who who primarily just crunches data in R and Python without the ability to actually build the pipelines / tools or conduct research?
[+] [-] proverbialbunny|5 years ago|reply
From my experience, there are two types of data scientists who work who do infrastructure work: 1) Those who do not make the best data scientist because their skill set is too far in engineering land, leaving them weak where it counts. If the startup is relying on the data scientist to be profitable, I'd be cautious with these types. or 2) Someone who is senior, beyond senior really, who has worked both jobs, and doesn't mind doing both jobs. This unicorn is so rare it is mythical. The joke when the terminology was created is they're so rare no one has ever seen one, hence unicorn.
Me, I can not do the work I need to do if I'm on call. That is where I draw the line. That means hiring someone to monitor the infrastructure. Furthermore, I'm an okay architect, but you really do want to hire a specialist if you can help it for that. Do I help them with the infrastructure? Absolutely, but they're on call if a server is on fire. They have the admin login credentials, not me.
I get wearing multiple hats, but keep in mind to be a data scientist you're already wearing multiple hats. Being a data scientist is like double majoring and getting a phd. At what point are they stretched too thin? The consensus in the industry is they're already stretched too thin and should be broken up into different specialized roles.
>My question to HN is this; are there rolls at these larger companies for a Data Scientist who who primarily just crunches data in R and Python without the ability to actually build the pipelines / tools or conduct research?
That is the standard role, even at startups. However, the industry consensus these days is data scientists should have more responsibility when it comes to deploying models than previous standards.[1] So data scientists are being pushed in a more engineering direction, not with hosting sql servers and infrastructure, but with working with engineers to make sure the models are monitored properly. This change comes from model deployment being further automated as time goes on, making it easier for the data scientist to have more responsibility during this stage.
[1] source: https://www.dominodatalab.com/static/gfx/uploads/domino-mana... page 9. Suboptimal organization and incentive structures.
[+] [-] jorpal|5 years ago|reply
[+] [-] jesseryoung|5 years ago|reply
Things are getting better, Visual ETL tools are falling out of favor to proper coded ETL (spark, dbt, etc) and data teams are starting to see the value of actually engineering a solution instead of just throwing it over the wall to a DBA to deal with. But tooling, and general information on the web is still lacking. Pushing data engineers over "etl developers" or "bi developers" (or "data scientists") will drastically improve any organizations ability to actually deliver real analytics and hopefully an industry wide push will raise all ships.
[+] [-] Avalaxy|5 years ago|reply
[+] [-] itsoktocry|5 years ago|reply
The most valuable people in the data chain will be those that can take idea to near-production. Running ML libraries over clean datasets is overrated. The fact is, 80% of the value of "Data Science" comes from KPIs and basic stuff.
[+] [-] tpoacher|5 years ago|reply
We need [brand new definition of the same, which most people are even more confused what it means and how it's different from the old]!
[+] [-] eVoLInTHRo|5 years ago|reply
I think the glut of data scientist occurs because we clump so many different skills and disciplines under the single term "data scientist". Data scientists today come from so many different backgrounds that the definition means something different to everyone. Because of this, the surface area of possible skills that could be expected of a data scientist is vast, to the point where it's pretty unlikely to be sufficiently competent in all of them, let alone a majority.
I'd like to go back to a world where we had a little more specificity about what kind of data scientist you are (e.g. I had no problem with terms like statistician and data miner), which could help ground expectations that others have of us, and it'd also help clearly define the scope of various career paths for the next generation.
Sadly the individual who coined the term shows no contrition for the degree of confusion that the rest of us have been left to deal with: https://observer.com/2019/11/data-scientist-inventor-dj-pati...
[+] [-] slt2021|5 years ago|reply
[+] [-] tristor|5 years ago|reply
1. Appropriately identifying what data needs to be captured from a product to correctly operationalize it.
2. Understanding and modeling data structures in internal applications to identify and tune backend data storage mechanisms (including DBMS). Inclusive in this is helping the application development team pick the correct structure and implement it correctly.
3. Validating implementation of instrumentation within the application so that data cleaning isn't necessary and that telemetry can be appropriately reported on. Building said reports.
4. Doing ETL and taking care of out of band data management to link disparate systems within the business to help build holistic views of the business overall.
5. Be a safeguard against the over-collection of data, because data engineers understand that data isn't an asset, it's a liability that increase costs and risks as a business or product scales, and when there's not a specific need that can be articulated clearly for that data, collecting it is a user/customer-hostile action.
My experience has been that data is a crucial element to understand the health and state of the business with both breadth and depth at a given point in time and identify trends. However, it's mostly used by folks in management as a crutch to try to de-risk decision making, or worse as a political tool to give a faux support to a decision that's already been made but not yet publicized. Decisions carry inherent risk, including the decision to do nothing, you cannot eliminate this, it's one of the components of decision trade-offs. This sort of broken use of data by management is supported by "Data Scientists" that see the field as a cash-cow they can milk while they work on pie-in-the-sky ML strategies which are often unnecessary, even when they actually work.
Done correctly a strong data culture in a company can increase decision velocity, empower engineers, and reduce overhead on management to understand the business. Done improperly, data culture in a business can easily destroy decision velocity, empower dysfunctional politics, and increase engineering overhead to understand systems. Getting it right is the main test for businesses in the new era.
[+] [-] svordsmith918|5 years ago|reply
[+] [-] lamename|5 years ago|reply
[+] [-] afryer|5 years ago|reply
He clearly articulates team roles: from Biz Devs, marketers, PMs, UX designers, UI designers, Web Developers, API Engineers, Data scientists, Applied researchers, Platform/Data engineers, QA engineers, DevOps Engineers.
Then he talks about different ways to increase agility by combining these roles into generalists empowered to iteratively explore the "pyramid of data value" until the right product-market fit is found.
Building Data-science Intensive Web Applications is inherently waterfall, not agile, and I find this book to be a fascinating reference.
[1] https://www.oreilly.com/library/view/agile-data-science/9781...
[+] [-] zeku|5 years ago|reply
What learning path should I go down? I'm a solo actor at work with a lot of agency to decide my workflows.
I see myself building small to medium size data collections over the next year or two at my job.
Can someone point me to some learning?
I have a CS degree etc. and my title in software engineer etc. etc.
End users of my data usually like their data as a CSV that is then read using R or Python. However there is also a use case where I will build an app to view my data in a simple way.
All of this is completely doable with my current knowledge/workflow but I can't help feel like I do a data engineering job with very different tools than i see "data engineers" speaking about online.
[+] [-] brd|5 years ago|reply
There are different kinds of developers. At it's most base form, you have systems focused developers and algorithmic focused developers. Sure there is a grey area but I think those two buckets are pretty defensible.
In the data science world you have an exact parallel. Those who build the systems and those who optimize the thing the system supports.
In the ML world you have another parallel. Those who build the systems and those who optimize and pioneer the model architectures and parameters.
We never reached consensus on the titles for different kinds of developer/programmer/computer scientists. And we're failing now to reach consensus on sane titles for ML and DS.