Skip to main content

3 posts tagged with "elt"

View All Tags

· 7 min read
Adrian Brudaru

Free APIs for Data Engineering

Practicing data engineering is better with real data sources. If you are considering doing a data engineering project, consider the following:

  • Ideally, your data has entities and activities, so you can model dimensions and facts.
  • Ideally, the APIs have no auth, so they can be easily tested.
  • Ideally, the API should have some use case that you are modelling and showing the data for.
  • Ideally, you build end-to-end pipelines to showcase extraction, ingestion, modelling and displaying data.

This article outlines 10 APIs, detailing their use cases, any free tier limitations, and authentication needs.

Material teaching data loading with dlt:

Data talks club data engineering zoomcamp

Data talks club open source spotlight

Docs

APIs Overview

1. PokeAPI

  • URL: PokeAPI.
  • Use: Import Pokémon data for projects on data relationships and stats visualization.
  • Free: Rate-limited to 100 requests/IP/minute.
  • Auth: None.

2. REST Countries API

  • URL: REST Countries.
  • Use: Access country data for projects analyzing global metrics.
  • Free: Unlimited.
  • Auth: None.

3. OpenWeather API

  • URL: OpenWeather.
  • Use: Fetch weather data for climate analysis and predictive modeling.
  • Free: Limited requests and features.
  • Auth: API key.

4. JSONPlaceholder API

  • URL: JSONPlaceholder.
  • Use: Ideal for testing and prototyping with fake data. Use it to simulate CRUD operations on posts, comments, and user data.
  • Free: Unlimited.
  • Auth: None required.

5. Quandl API

  • URL: Quandl.
  • Use: For financial market trends and economic indicators analysis.
  • Free: Some datasets require premium.
  • Auth: API key.

6. GitHub API

  • URL: GitHub API
  • Use: Analyze open-source trends, collaborations, or stargazers data. You can use it from our verified sources repository.
  • Free: 60 requests/hour unauthenticated, 5000 authenticated.
  • Auth: OAuth or personal access token.

7. NASA API

  • URL: NASA API.
  • Use: Space-related data for projects on space exploration or earth science.
  • Free: Rate-limited.
  • Auth: API key.

8. The Movie Database (TMDb) API

  • URL: TMDb API.
  • Use: Movie and TV data for entertainment industry trend analysis.
  • Free: Requires attribution.
  • Auth: API key.

9. CoinGecko API

  • URL: CoinGecko API.
  • Use: Cryptocurrency data for market trend analysis or predictive modeling.
  • Free: Rate-limited.
  • Auth: None.

10. Public APIs GitHub list

  • URL: Public APIs list.
  • Use: Discover APIs for various projects. A meta-resource.
  • Free: Varies by API.
  • Auth: Depends on API.

11. News API

  • URL: News API.
  • Use: Get datasets containing current and historic news articles.
  • Free: Access to current news articles.
  • Auth: API-Key.

12. Exchangerates API

  • URL: Exchangerate API.
  • Use: Get realtime, intraday and historic currency rates.
  • Free: 250 monthly requests.
  • Auth: API-Key.

13. Spotify API

  • URL: Spotify API.
  • Use: Get spotify content and metadata about songs.
  • Free: Rate limit.
  • Auth: API-Key.

14. Football API

  • URL: FootBall API.
  • Use: Get information about Football Leagues & Cups.
  • Free: 100 requests/day.
  • Auth: API-Key.

15. Yahoo Finance API

  • URL: Yahoo Finance API.
  • Use: Access a wide range of financial data.
  • Free: 500 requests/month.
  • Auth: API-Key.

16. Basketball API

  • URL: Basketball API.
  • Use: Get information about basketball leagues & cups.
  • Free: 100 requests/day.
  • Auth: API-Key.

17. NY Times API

  • URL: NY Times API.
  • Use: Get info about articles, books, movies and more.
  • Free: 500 requests/day or 5 requests/minute.
  • Auth: API-Key.

18. Spoonacular API

  • URL: Spoonacular API.
  • Use: Get info about ingredients, recipes, products and menu items.
  • Free: 150 requests/day and 1 request/sec.
  • Auth: API-Key.

19. Movie database alternative API

  • URL: Movie database alternative API.
  • Use: Movie data for entertainment industry trend analysis.
  • Free: 1000 requests/day and 10 requests/sec.
  • Auth: API-Key.

20. RAWG Video games database API

  • URL: RAWG Video Games Database.
  • Use: Gather video game data, such as release dates, platforms, genres, and reviews.
  • Free: Unlimited requests for limited endpoints.
  • Auth: API key.

21. Jikan API

  • URL: Jikan API.
  • Use: Access data from MyAnimeList for anime and manga projects.
  • Free: Rate-limited.
  • Auth: None.

22. Open Library Books API

  • URL: Open Library Books API.
  • Use: Access data about millions of books, including titles, authors, and publication dates.
  • Free: Unlimited.
  • Auth: None.

23. YouTube Data API

  • URL: YouTube Data API.
  • Use: Access YouTube video data, channels, playlists, etc.
  • Free: Limited quota.
  • Auth: Google API key and OAuth 2.0.

24. Reddit API

  • URL: Reddit API.
  • Use: Access Reddit data for social media analysis or content retrieval.
  • Free: Rate-limited.
  • Auth: OAuth 2.0.

25. World Bank API

  • URL: World bank API.
  • Use: Access economic and development data from the World Bank.
  • Free: Unlimited.
  • Auth: None.

Each API offers unique insights for data engineering, from ingestion to visualization. Check each API's documentation for up-to-date details on limitations and authentication.

Using the above sources

You can create a pipeline for the APIs discussed above by using dlt's REST API source. Let’s create a PokeAPI pipeline as an example. Follow these steps:

  1. Create a Rest API source:

    dlt init rest_api duckdb
  2. The following directory structure gets generated:

    rest_api_pipeline/
    ├── .dlt/
    │ ├── config.toml # configs for your pipeline
    │ └── secrets.toml # secrets for your pipeline
    ├── rest_api/ # folder with source-specific files
    │ └── ...
    ├── rest_api_pipeline.py # your main pipeline script
    ├── requirements.txt # dependencies for your pipeline
    └── .gitignore # ignore files for git (not required)
  3. Configure the source in rest_api_pipeline.py:

    def load_pokemon() -> None:
    pipeline = dlt.pipeline(
    pipeline_name="rest_api_pokemon",
    destination='duckdb',
    dataset_name="rest_api_data",
    )

    pokemon_source = rest_api_source(
    {
    "client": {
    "base_url": "https://pokeapi.co/api/v2/",
    },
    "resource_defaults": {
    "endpoint": {
    "params": {
    "limit": 1000,
    },
    },
    },
    "resources": [
    "pokemon",
    "berry",
    "location",
    ],
    }
    )

For a detailed guide on creating a pipeline using the Rest API source, please read the Rest API source documentation here.

Example projects

Here are some examples from dlt users and working students:

DTC learners showcase

Check out the incredible projects from our DTC learners:

  1. e2e_de_project by scpkobayashi.
  2. de-zoomcamp-project by theDataFixer.
  3. data-engineering-zoomcamp2024-project2 by pavlokurochka.
  4. de-zoomcamp-2024 by snehangsude.
  5. zoomcamp-data-engineer-2024 by eokwukwe.
  6. data-engineering-zoomcamp-alex by aaalexlit.
  7. Zoomcamp2024 by alfredzou.
  8. data-engineering-zoomcamp by el-grudge.

Explore these projects to see the innovative solutions and hard work the learners have put into their data engineering journeys!

Showcase your project

If you want your project to be featured, let us know in the #sharing-and-contributing channel of our community Slack.

· 6 min read
Adrian Brudaru
info

PSSSST! You do ELT, right? not ETL? asking for a friend...

ETL vs ELT? A vendor driven story.

One of the earliest tooling for "ETL" data was Pentaho Kettle. Kettle stands for "Kettle Extraction Transformation Transport Load Environment" and signifies that it transforms the data before loading it. It was usually used to load data which was later transformed in SQL via "SQL scripts", while still in the tool, or via database triggers or views outside of the tool.

Indeed, the tool creators imagined some folks would write java to transform before loading, but the vast majority of data users just wanted to use SQL.

Sounds familiar? This is not so different to today's "ELT", is it?

Why did we call it ELT?

The people

Well, first of all SQL is much more accessible and very powerful for transforming tables, columns and rows - where programming handles single values. So before purpose built tooling existed, data people were already doing the transform in SQL - it just made sense.

The "EL" vendors

In the decade following Pentaho, Saas solutions started offering pipelines that load data into your database, removing the option for you to tinker with it before loading. For this reason, they would call it "ELT".

The db vendors

The concept also resonated with MPP DBs (massive parallel processing), such as Snowflake, Redshift, Bigquery, which were more than happy to encourage doing all the compute on their side.

The "T in ELT" vendors

Another puzzle piece was dbt, a tool purpose built for SQL transform. So if there's a question of ETL or ELT, dbt can only answer ELT. In dbt's word view, data starts dirty in your warehouse, where you "rename, cast, join, enrich" - a true ELT. To make the drudgery of data cleaning in SQL easier, dbt offers some python support to enable generating some of the typing and renaming SQL. They also offer a litte bit of python support for scalar operations in some db vendor systems.

What do we really do?

Most of us do a little bit of both - we extract with python, and the next steps are loading, cleaning and curation. In some cases, cleaning and curation are optional. For example, when we load a report from another platform we will probably not need to clean or curate anything.

Where do we clean data?

Data cleaning usually refers to normalising the data into correct types, usable names, etc. Doing this in SQL results in writing a lot of manual code that needs to be maintained. On the other hand, sturcturing data in python isn't easy either, it's just less technically difficult, but when metadata is missing, it becomes guesswork.

So, technically the easier place to clean data is in python, but likely the majority will do it in SQL as they are more practiced in SQL.

Where do we transform data?

When it comes to working with tables, SQL is still the better place to be. Joins and aggregations are the core operations that will happen here and they would be much harder to handle scalably in python.

dlt puts the small t back in EtlT, let's see how.

So, python is still superior at a few operations

  • Typing, renaming, normalising, unpacking
  • complex scalar operations

While we will leave the aggregations and joins to the big T, SQL.

Normalisation, typing, unpacking

dlt does this well out of the box. Automatic typing, renaming, flattening, and ddl deployment are all handled by the schema inference and evolution engine. This engine is configurable in both how it works and what it does, you can read more here: Normaliser, schema settings

Here is a usage example (it's built into the pipeline):


import dlt

# Json, dataframes, iterables, all good
# the data will be auto typed and normalised
data = [{'id': 1, 'name': 'John'}]

# open connection
pipe = dlt.pipeline(destination='bigquery',
dataset_name='raw_data')

# self-explanatory declarative interface
job_status = pipe.run(data,
write_disposition="merge",
primary_key="id",
table_name="users")

# optionally load schema and metadata
pipe.run([job_status],
write_disposition="append",
table_name="loading_status")

Scalar operations

Sometimes we need to edit a column's value in some very specific way for which SQL doesn't quite cut it. Sometimes, we have data we need to pseudonymise before loading for regulatory reasons.

Because dlt is a library, it means you can easily change how the data stream is produced or ingested. Besides your own customisations, dlt also supports injecting your transform code inside the event stream, see an example here

Here is a code example of pseudonymisation, a common case where data needs to be transformed before loading:

import dlt
import hashlib

@dlt.source
def dummy_source(prefix: str = None):
@dlt.resource
def dummy_data():
for _ in range(3):
yield {'id':_, 'name': f'Jane Washington {_}'}
return dummy_data(),

def pseudonymize_name(doc):
'''
Pseudonmyisation is a deterministic type of PII-obscuring
Its role is to allow identifying users by their hash,
without revealing the underlying info.
'''
# add a constant salt to generate
salt = 'WI@N57%zZrmk#88c'
salted_string = doc['name'] + salt
sh = hashlib.sha256()
sh.update(salted_string.encode())
hashed_string = sh.digest().hex()
doc['name'] = hashed_string
return doc


# 1. Create an instance of the source so you can edit it.
data_source = dummy_source()
# 2. Modify this source instance's resource
data_resource = data_source.dummy_data().add_map(pseudonymize_name)
# 3. Inspect your result
for row in data_resource:
print(row)
#{'id': 0, 'name': '96259edb2b28b48bebce8278c550e99fbdc4a3fac8189e6b90f183ecff01c442'}
#{'id': 1, 'name': '92d3972b625cbd21f28782fb5c89552ce1aa09281892a2ab32aee8feeb3544a1'}
#{'id': 2, 'name': '443679926a7cff506a3b5d5d094dc7734861352b9e0791af5d39db5a7356d11a'}

pipeline = dlt.pipeline(pipeline_name='example', destination='bigquery', dataset_name='normalized_data')
load_info = pipeline.run(data_resource)

The big T

Finally, once you have clean data loaded, you will probably prefer to use SQL and one of the standard tools. dlt offers a dbt runner to get you started easily with your transformation package.

pipeline = dlt.pipeline(
pipeline_name='pipedrive',
destination='bigquery',
dataset_name='pipedrive_dbt'
)

# make or restore venv for dbt, using latest dbt version
venv = dlt.dbt.get_venv(pipeline)

# get runner, optionally pass the venv
dbt = dlt.dbt.package(
pipeline,
"pipedrive/dbt_pipedrive/pipedrive", # or use public git "https://github.com/dbt-labs/jaffle_shop.git"
venv=venv
)

# run the models and collect any info
# If running fails, the error will be raised with full stack trace
models = dbt.run_all()

#optionally log dbt status
pipeline.run([models],
write_disposition="append",
table_name="_models_log")

In conclusion

ETL vs ELT was never really a debate. With some exceptions almost everyone transforms the data in SQL - but what they call this process depends on who's telling the story.

While it's easier to do most of the transformation in SQL, the tedious is completely automatable in python, and the dirty data doesn't need manual normalisation. With dlt, you can do ETL or ELT, or even better, both, as EtLT

Or, if you're feeling funny, you can add duckdb in the middle and go full EtLTLT where you have an additional T step in the middle for the kinds of operations that could be done locally. And afterwards you could load to operational systems to add one more L to the name :)

Fundamentally, we all agree it's all ETL, with the flavors simply designating specific sub-types.

Start using dlt today

What are you waiting for?

· 5 min read
Tong Chen
info

💡Check out the accompanying colab demo: Google Colaboratory demo


Hi there! 👋 In this article, I will show you a demo on how to train ChatGPT with the open-source dlt repository. Here is the article structure, and you can jump directly to the part that interests you. Let's get started!

I. Introduction

II. Walkthrough

III. Result

IV. Summary

I. Introduction

Navigating an open-source repository can be overwhelming because comprehending the intricate labyrinths of code is always a significant problem. As a person who just entered the IT industry, I found an easy way to address this problem with an ELT tool called dlt (data load tool) - the Python library for loading data.

In this article, I would love to share a use case - training GPT with an Open-Source dlt Repository by using the dlt library. In this way, I can write prompts about dlt and get my personalized answers.

II. Walkthrough

The code provided below demonstrates training a chat-oriented GPT model using the dlt- hub repositories (dlt and pipelines). To train the GPT model, we utilized the assistance of two services: Langchain and Deeplake. In order to use these services for our project, you will need to create an account on both platforms and obtain the access token. The good news is that both services offer cost-effective options. GPT provides a $5 credit to test their API, while Deeplake offers a free tier.

The credit for the code goes to Langchain, which has been duly acknowledged at the end.

1. Run the following commands to install the necessary modules on your system.

python -m pip install --upgrade langchain deeplake openai tiktoken
# Create accounts on platform.openai.com and deeplake.ai. After registering, retrieve the access tokens for both platforms and securely store them for use in the next step. Enter the access tokens grabbed in the last step and enter them when prompted

import os
import getpass

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Activeloop Token:')
embeddings = OpenAIEmbeddings(disallowed_special=())

2. Create a directory to store the code for training the model. Clone the desired repositories into that.

  # making a new directory named dlt-repo
!mkdir dlt-repo
# changing the directory to dlt-repo
%cd dlt-repo
# cloning git repos into the dlt-repo directory
# dlt code base
!git clone https://github.com/dlt-hub/dlt.git
# example pipelines to help you get started
!git clone https://github.com/dlt-hub/pipelines.git
# going back to previous directory
%cd ..

3. Load the files from the directory

import os
from langchain.document_loaders import TextLoader

root_dir = './dlt-repo' # load data from
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
for file in filenames:
try:
loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
docs.extend(loader.load_and_split())
except Exception as e:
pass

4. Load the files from the directory

import os
from langchain.document_loaders import TextLoader

root_dir = './dlt-repo' # load data from
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
for file in filenames:
try:
loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
docs.extend(loader.load_and_split())
except Exception as e:
pass

5. Splitting files to chunks

# This code uses CharacterTextSplitter to split documents into smaller chunksbased on character count and store the resulting chunks in the texts variable.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

6. Create Deeplake dataset

# Set up your deeplake dataset by replacing the username with your Deeplake account and setting the dataset name. For example if the deeplakes username is “your_name” and the dataset is “dlt-hub-dataset” 

username = "your_deeplake_username" # replace with your username from app.activeloop.ai
db = DeepLake(dataset_path=f"hub://{username}/dlt_gpt", embedding_function=embeddings, public=True) #dataset would be publicly available
db.add_documents(texts)

# Assign the dataset and embeddings to the variable db , using deeplake dataset.
# Replace your_username with actual username
db = DeepLake(dataset_path="hub://"your_username"/dlt_gpt", read_only=True, embedding_function=embeddings)

# Create a retriever
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 100
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10

7. Initialize the GPT model

from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name='gpt-3.5-turbo')
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)

III. Result

After the walkthrough, we can start to experiment different questions and it will output answers based on our training from dlt hub repository.

Here, I asked " why should data teams use dlt? "

chatgptq1

It outputted:

  1. It works seamlessly with Airflow and other workflow managers, making it easy to modify and maintain your code.
  2. You have complete control over your data. You can rename, filter, and modify it however you want before it reaches its destination.

Next, I asked " Who is dlt for? "

chatgptq2

It outputted:

  1. dlt is meant to be accessible to every person on the data team, including data engineers, analysts, data scientists, and other stakeholders involved in data loading. It is designed to reduce knowledge requirements and enable collaborative working between engineers and analysts.

IV. Summary

It worked! we can see how GPT can learn about an open source library by using dlt and utilizing the assistance of Langchain and Deeplake. Moreover, by simply follow the steps above, you can customize the GPT model training to your own needs.

Curious? Give the Colab demo💡 a try or share your questions with us, and we'll have ChatGPT address them in our upcoming article.


[ What's more? ]

  • Learn more about [dlt] 👉 here
  • Need help or want to discuss? Join our Slack community ! See you there 😊

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.