Does anyone have a list of high quality data engineering, that aren't free? I don't mind paying, but there is so much out there and most of it is not great.
Have you considered that largest part of the target audience for becoming a better data engineer isn't people who are already great at data engineering..?
Looking over the resources, it doesn't seem worthwhile.
The difficult part is not learning how to code, or work with SQL. The hard part is learning the platform and tooling you need to operate at scale. The ecosystem is full of tools that are great for certain workloads, but terrible for others.
Your best bet is to start by getting an overview of the tools available for your team. If you're using AWS, GCP, or Azure, they each have data engineer-oriented certifications. So take a look at what tools those certification courses cover and start there.
If you are not in a cloud environment, take a look at Apache Airflow, Beam, Storm, or Hadoop. Most of the tooling provided by the big cloud providers is either a rip off of one these products, or is merely a hosted version (i.e, GCP Cloud Composer is managed Airflow).
My title says Senior Data Engineer* and from what I can tell it looks like a collection of interesting things but nothing so revolutionary. You're better off identifying what aspects your team is weak in and seeking out experts or training in that specifically.
Let’s say you are the Lead Data Engineer of a small data-driven company.
You need to define a strategy, pick a stack of tools, decide how data is going to be stored and normalised, what the workflow will be from ad-hoc, exploratory studies to productionized inference.
Are there any good resources out there that are useful to this person?
I am this person right now and I need to find some good guidance.
I am that person too, only for a big (non-data) money driven company :-)
So many resources that you can easily get lost. Martin Kleppmann's Designing Data Intensive applications can be your starting point. It helps you establish a basic to advanced understanding on quite a few of the concepts that will be coming up and some key principles to drive your strategy from a technical perspective (you'll need a few facets of your strategy for different audiences).
Then move to a more corporate focused presentation with Piethein Strengholt's Data Management at Scale (business facing aspects of your strategy, incl. governance forums etc. unless if you are short-term lucky enough to not have them due to size - long term unlucky as you'll have to establish them or drive others to do so).
At this point, after a few discussions you should be getting a feeling of what the direction will be in terms of where your data will be stored, how you do data quality, how you process, how you expose, infrastructure etc. Dozens of books on the individual elements of your stack. Try to link them back to Kleppmann or other more specialized but still conceptual books (e.g. if you do streaming you could look into Flow Architectures by Urquhart, Streaming Systems By Akidau et.al. etc.) Then you can move to inference, etc. I am not at that stage yet, so no specific advice. In my case, I see inference etc. as more of something I can address after data are on the platform, but not sure what the state you are facing is. I guess you can start looking into trendy stuff like MLOps etc.
Good luck! It's really exciting working on this domain!
I have been in the same position, and here's what my experience has been(sample size 1)
Standardize the stack. Use one stack, one set of tooling, one set of practices, and libraries that are very well known and has decent community support. As you start delivering code, the knowledge builds upon itself.
For example, I pretty much standardized REST API, microservices, Python, Flask, Redis, MongoDB, Nginx for the first 2-3 years. As teammates joined in, they were encouraged to reuse code. Slowly, as more senior engineers started working in the team, they brought their own processes, and consequently the organization became more flexible and robust.
They are basically recommending picking one of the big three: Snowflake, Bigquery, or Redshift. Using standard connectors. Once your data is in one of the big three, you can use any analysis tool.
I have been involved in this space for some time and am heavily involved with Salesforce. In my corner of the world, I still lean heavily on SQL Server and Cdata/Dbamp which allows me to replicate and write back to Salesforce via SQL batch jobs. Tableau Online has also been a game changer as it allows business folks to do a lot of the heavy lifting. Our BI team now consists of three senior tech folks and 12 business analysts.
frankbreetz|4 years ago
soobrosa|4 years ago
or you can also join our full-time/part-time course https://www.dataengineering.academy/curriculum
lixtra|4 years ago
preetamjinka|4 years ago
primax|4 years ago
blowski|4 years ago
mywittyname|4 years ago
The difficult part is not learning how to code, or work with SQL. The hard part is learning the platform and tooling you need to operate at scale. The ecosystem is full of tools that are great for certain workloads, but terrible for others.
Your best bet is to start by getting an overview of the tools available for your team. If you're using AWS, GCP, or Azure, they each have data engineer-oriented certifications. So take a look at what tools those certification courses cover and start there.
If you are not in a cloud environment, take a look at Apache Airflow, Beam, Storm, or Hadoop. Most of the tooling provided by the big cloud providers is either a rip off of one these products, or is merely a hosted version (i.e, GCP Cloud Composer is managed Airflow).
gigatexal|4 years ago
* titles are kinda shit analogs for skill
caffeine|4 years ago
You need to define a strategy, pick a stack of tools, decide how data is going to be stored and normalised, what the workflow will be from ad-hoc, exploratory studies to productionized inference.
Are there any good resources out there that are useful to this person?
I am this person right now and I need to find some good guidance.
cgio|4 years ago
So many resources that you can easily get lost. Martin Kleppmann's Designing Data Intensive applications can be your starting point. It helps you establish a basic to advanced understanding on quite a few of the concepts that will be coming up and some key principles to drive your strategy from a technical perspective (you'll need a few facets of your strategy for different audiences).
Then move to a more corporate focused presentation with Piethein Strengholt's Data Management at Scale (business facing aspects of your strategy, incl. governance forums etc. unless if you are short-term lucky enough to not have them due to size - long term unlucky as you'll have to establish them or drive others to do so).
At this point, after a few discussions you should be getting a feeling of what the direction will be in terms of where your data will be stored, how you do data quality, how you process, how you expose, infrastructure etc. Dozens of books on the individual elements of your stack. Try to link them back to Kleppmann or other more specialized but still conceptual books (e.g. if you do streaming you could look into Flow Architectures by Urquhart, Streaming Systems By Akidau et.al. etc.) Then you can move to inference, etc. I am not at that stage yet, so no specific advice. In my case, I see inference etc. as more of something I can address after data are on the platform, but not sure what the state you are facing is. I guess you can start looking into trendy stuff like MLOps etc.
Good luck! It's really exciting working on this domain!
chintler|4 years ago
Standardize the stack. Use one stack, one set of tooling, one set of practices, and libraries that are very well known and has decent community support. As you start delivering code, the knowledge builds upon itself.
For example, I pretty much standardized REST API, microservices, Python, Flask, Redis, MongoDB, Nginx for the first 2-3 years. As teammates joined in, they were encouraged to reuse code. Slowly, as more senior engineers started working in the team, they brought their own processes, and consequently the organization became more flexible and robust.
rawgabbit|4 years ago
They are basically recommending picking one of the big three: Snowflake, Bigquery, or Redshift. Using standard connectors. Once your data is in one of the big three, you can use any analysis tool.
I have been involved in this space for some time and am heavily involved with Salesforce. In my corner of the world, I still lean heavily on SQL Server and Cdata/Dbamp which allows me to replicate and write back to Salesforce via SQL batch jobs. Tableau Online has also been a game changer as it allows business folks to do a lot of the heavy lifting. Our BI team now consists of three senior tech folks and 12 business analysts.
soobrosa|4 years ago
soobrosa|4 years ago
A small addition: one of our advisors, Dr. Martin Loetzsch just shared for free the material he's teaching also at Pipeline Data Engineering Academy: https://www.youtube.com/watch?v=8HlNG8bdlM0 https://www.youtube.com/watch?v=24Uvo5vZJWA
exdsq|4 years ago
wswope|4 years ago