Skip to main content

2 posts tagged with "structured data"

View All Tags

· 6 min read
Adrian Brudaru

automated pipeline automaton

Why is there a data engineer shortage?

  1. High Demand and Rapid Growth: The increasing reliance on data-driven decision-making and the rise of big data technologies have created a surge in demand for skilled data engineers.
  2. Skill Gap and Specialization: Data engineering requires a unique blend of technical skills, and finding individuals with the right combination of programming, database management, and cloud computing expertise can be challenging.
  3. Competition from Other Data Roles: The allure of data science and other data-related roles has attracted professionals, leading to a talent shortage in the data engineering field.

How big is the data engineer shortage?

💡 "In Europe there are 32K data engineers and 48K open positions to hire one. In the US the ratio is 41K to 79K" Source: Linkedin data analysis blog post

Well that doesn’t look too bad - if only we could all be about 2x as efficient :)

Bridging the gap: How to make your data engineers 2x more efficient?

There are 2 ways to make the data engineers more efficient:

Option 1: Give them more to do, tell them how to do their jobs better!

For some reason, this doesn’t work out great. All the great minds of our generation told us we should be more like them

  • do more architecture;
  • learn more tech;
  • use this new toy!
  • learn this paradigm.
  • take a step back and consider your career choices.
  • write more tests;
  • test the tests!
  • analyse the tests :[
  • write a paper about the tests...
  • do all that while alerts go off 24/7 and you are the bottleneck for everyone downstream, analysts and business people screaming. (┛ಠ_ಠ)┛彡┻━┻

“I can't do what ten people tell me to do. So I guess I'll remain the same”

  • Otis Redding, Sittin' On The Dock Of The Bay

Option 2: Take away unproductive work

A data engineer has a pretty limited task repertoire - so could we give some of their work to roles we can hire?

Let’s see what a data engineer does, according to GPT:

  • Data curation: Ensuring data quality, integrity, and consistency by performing data profiling, cleaning, transformation, and validation tasks.
  • Collaboration with analysts: Working closely with data analysts to understand their requirements, provide them with clean and structured data, and assist in data exploration and analysis.
  • Collaboration with DWH architects: Collaborating with data warehouse architects to design and optimize data models, schemas, and data pipelines for efficient data storage and retrieval.
  • Collaboration with governance managers: Partnering with governance managers to ensure compliance with data governance policies, standards, and regulations, including data privacy, security, and data lifecycle management.
  • Structuring and loading: Designing and developing data pipelines, ETL processes, and workflows to extract, transform, and load data from various sources into the target data structures.
  • Performance optimization: Identifying and implementing optimizations to enhance data processing and query performance, such as indexing, partitioning, and data caching.
  • Data documentation: Documenting data structures, data lineage, and metadata to facilitate understanding, collaboration, and data governance efforts.
  • Data troubleshooting: Investigating and resolving data-related issues, troubleshooting data anomalies, and providing support to resolve data-related incidents or problems.
  • Data collaboration and sharing: Facilitating data collaboration and sharing across teams, ensuring data accessibility, and promoting data-driven decision-making within the organization.
  • Continuous improvement: Staying updated with emerging technologies, industry trends, and best practices in data engineering, and actively seeking opportunities to improve data processes, quality, and efficiency.

Let’s get a back of the napkin estimation of how much time they spend on those areas

Here’s an approximation as offered by GPT. Of course, actual numbers depend on the maturity of your team and their unique challenges.

  • Collaboration with others (including data curation): Approximately 40-60% of their working hours. This includes tasks such as collaborating with team members, understanding requirements, data curation activities, participating in meetings, and coordinating data-related activities.
  • Data analysis: Around 10-30% of their working hours. This involves supporting data exploration, providing insights, and assisting analysts in understanding and extracting value from the data.
  • Technical problem-solving (structuring, maintenance, optimization): Roughly 30-50% of their working hours. This includes solving data structuring problems, maintaining existing data structures, optimizing data pipelines, troubleshooting technical issues, and continuously improving processes.

By looking at it this way, solutions become clear:

  • Let someone else do curation. Analysts could talk directly to producers. By removing the middle man, you improve speed and quality of the process too.
  • Automate data structuring: While this is not as time consuming as the collaboration, it’s the second most time consuming process.
  • Let analyst do exploration of structured data at curation, not before load. This is a minor optimisation, but 10-30% is still very significant towards our goal of reducing workload by 50%.

How much of their time could be saved?

Chat GPT thinks:

it is reasonable to expect significant time savings with the following estimates:

  1. Automation of Structuring and Maintenance: By automating the structuring and maintenance of data, data engineers can save 30-50% or more of their time previously spent on these tasks. This includes activities like schema evolution, data transformation, and pipeline optimization, which can be streamlined through automation.
  2. Analysts and Producers Handling Curation: Shifting the responsibility of data curation to analysts and producers can save an additional 10-30% of the data engineer's time. This includes tasks such as data cleaning, data validation, and data quality assurance, which can be effectively performed by individuals closer to the data and its context.

It's important to note that these estimates are approximate and can vary based on the specific circumstances and skill sets within the team.

40-80% of a data engineer’s time could be spared

💡 40-80% of a data engineer’s time could be spared

To achieve that,

  • Automate data structuring.
  • Govern the data without the data engineer.
  • Let analysts explore data as part of curation, instead of asking data engineers to do it.

This looks good enough for solving the talent shortage. Not only that, but doing things this way lets your team focus on what they do best.

A recipe to do it

  1. Use something with schema inference and evolution to load your data.
  2. Notify stakeholders and producers of data changes, so they can curate it.
  3. Don’t explore json with data engineers - let analyst explore structured data.

Ready to stop the pain? Read this explainer on how to do schema evolution with dlt. Want to discuss? Join our slack.

· 6 min read
Adrian Brudaru

Schema evolution combines a technical process with a curation process, so let's understand the process, and where the technical automation needs to be combined with human curation.

Whether you are aware or not, you are always getting structured data for usage

Data used is always structured, but usually produced unstructured.

Structuring it implicitly during reading is called "schema on read", while structuring it upfront is called "schema on write".

To fit unstructured data into a structured database, developers have to perform this transition before loading. For data lake users who read unstructured data, their pipelines apply a schema during read - if this schema is violated, the downstream software will produce bad outcomes.

We tried running away from our problems, but it didn't work.

Because structuring data is difficult to deal with, people have tried to not do it. But this created its own issues.

  • Loading json into db without typing or structuring - This anti-pattern was created to shift the structuring of data to the analyst. While this is a good move for curation, the db support for structuring data is minimal and unsafe. In practice, this translates to the analyst spending their time writing lots of untested parsing code and pushing silent bugs to production.
  • Loading unstructured data to lakes - This pattern pushes the curation of data to the analyst. The problem here is similar to the one above. Unstructured data is hard to analyse and curate, and the farther it is from the producer, the harder it is to understand.

So no, one way or another we are using schemas.

If curation is hard, how can we make it easier?

  • Make data easier to discover, analyze, explore. Structuring upfront would do that.
  • Simplify the human process by decentralizing data ownership and curation - the analyst can work directly with the producer to define the dataset produced.

Structuring & curating data are two separate problems. Together they are more than the sum of the parts.

The problem is that curating data is hard.

  • Typing and normalising data are technical processes.
  • Curating data is a business process.

Here's what a pipeline building process looks like:

  1. Speak with the producer to understand what the data is. Chances are the producer does not document it and there will be many cases that need to be validated analytically.
  2. Speak with the analyst or stakeholder to get their requirements. Guess which fields fulfill their requirements.
  3. Combine the 2 pieces of info to filter and structure the data so it can be loaded.
  4. Type the data (for example, convert strings to datetime).
  5. Load the data to warehouse. Analyst can now validate if this was the desired data with the correct assumptions.
  6. Analyst validates with stakeholder that this is the data they wanted. Stakeholder usually wants more.
  7. Possibly adjust the data filtering, normalization.
  8. Repeat entire process for each adjustment.

And when something changes,

  1. The data engineer sees something break.
  2. They ask the producer about it.
  3. They notify the analyst about it.
  4. The analyst notifies the business that data will stop flowing until adjustments.
  5. The analyst discusses with the stakeholder to get any updated requirements.
  6. The analyst offers the requirements to the data engineer.
  7. The data engineer checks with the producer/data how the new data should be loaded.
  8. Data engineer loads the new data.
  9. The analyst can now adjust their scripts, re-run them, and offer data to stakeholder.

Divide et impera! The two problems are technical and communicational, so let's let computers solve tech and let humans solve communication.

Before we start solving, let's understand the problem:

  1. For usage, data needs to be structured.
  2. Because structuring is hard, we try to reduce the amount we do by curating first or defering to the analyst by loading unstructured data.
  3. Now we are trying to solve two problems at once: structuring and curation, with each role functioning as a bottleneck for the other.

So let's de-couple these two problems and solve them appropriately:

  • The technical issue is that unstructured data needs to be structured.
  • The curation issue relates to communication - so taking the engineer out of the loop would make this easier.

Automate the tech: Structuring, typing, normalizing

The only reason to keep data unstructured was the difficulty of applying structure.

By automating schema inference, evolution, normalization, and typing, we can just load our jsons into structured data stores, and curate it in a separate step.

Alert the communicators: When there is new data, alert the producer and the curator.

To govern how data is produced and used, we need to have a definition of the data that the producer and consumer can both refer to. This has typically been tackled with data contracts - a type of technical test that would notify the producer and consumer of violations.

So how would a data contract work?

  1. Human process:
    1. Humans define a data schema.
    2. Humans write a test to check if data conforms to the schema.
    3. Humans implement notifications for test fails.
  2. Technical process:
    1. Data is extracted.
    2. Data is staged to somewhere where it can be tested.
    3. Data is tested:
      1. If the test fails, we notify the producer and the curator.
      2. If the test succeeds, it gets transformed to the curated form.

So how would we do schema evolution with dlt?

  1. Data is extracted, dlt infers schema and can compare it to the previous schema.
  2. Data is loaded to a structured data lake (staging area).
  3. Destination schema is compared to the new incoming schema.
    1. If there are changes, we notify the producer and curator.
    2. If there are no changes, we carry on with transforming it to the curated form.

So, schema evolution is essentially a simpler way to do a contract on schemas. If you had additional business-logic tests, you would still need to implement them in a custom way.

The implementation recipe

  1. Use dlt. It will automatically infer and version schemas, so you can simply check if there are changes. You can just use the normaliser + loader or build extraction with dlt. If you want to define additional constraints, you can do so in the schema.
  2. Define your slack hook or create your own notification function. Make sure the slack channel contains the data producer and any stakeholders.
  3. Capture the load job info and send it to the hook.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.