Sync gharchive Website Data to Object Storage Using Python Tasks
The Python node in Lakehouse Studio provides Python code development, test execution, and scheduling capabilities. With scheduling, a single piece of code can handle both full-data backfill tasks and periodic scheduled tasks. By setting task dependencies, you can orchestrate hybrid workflows that combine Python tasks with SQL tasks, Shell scripts, data integration, and other task types.

Writing Python Code
import os,io
import subprocess
from datetime import datetime, timedelta
import oss2
# Alibaba Cloud OSS configuration. ak/sk are custom parameters. Modify ENDPOINT based on the actual OSS region.
ACCESS_KEY_ID = '${ak}'
ACCESS_KEY_SECRET = '${sk}'
BUCKET_NAME = 'YourBucketName'
ENDPOINT = 'oss-cn-shanghai-internal.aliyuncs.com'
ROOT_PATH = 'ghachive'
# Get current UTC+8 time
# beijing_time = datetime.now()
beijing_time = datetime.strptime('${datetime}', "%Y-%m-%d %H:%M:%S")
# Get file time. Offset Beijing time by 9 hours (8 hours timezone + 1 hour delay for gharchive data file generation, 8+1)
ny_time = beijing_time - timedelta(hours=9)
# Format the time
year = ny_time.strftime('%Y')
month = ny_time.strftime('%m')
day = ny_time.strftime('%d')
hour = ny_time.strftime('%H')
# Print the converted time
print(f"Converted to data file Time and -9 hour: {year}-{month}-{day} {hour}:00:00")
# Check if hour is in '0x' format, if so remove the leading zero
if hour.startswith('0') and len(hour) > 1:
# Remove the leading '0'
hour = hour[1:]
try:
# Build wget command
url = f"https://data.gharchive.org/{year}-{month}-{day}-{hour}.json.gz"
cmd = ["wget", "-qO-", url]
print(f"wget cmd: {cmd}")
# Execute wget command and capture output
wget_output = subprocess.check_output(cmd)
print(f"Wget file done...")
# Convert output to an in-memory file object
file_obj = io.BytesIO(wget_output)
except Exception as e:
print(f"An error occurred: {e}")
file_obj = None
# Raise exception to cooperate with task retries. Scheduling is set to retry 3 times with 10-minute intervals,
# to handle cases where the source file is not produced on time, improving robustness.
raise
if file_obj:
try:
# Initialize Alibaba Cloud OSS
auth = oss2.Auth(ACCESS_KEY_ID, ACCESS_KEY_SECRET)
bucket = oss2.Bucket(auth, ENDPOINT, BUCKET_NAME)
# Upload file to OSS
oss_path = f"{ROOT_PATH}/{year}/{month}/{day}/{year}-{month}-{day}-{hour}.json.gz"
print(f"osspath: {oss_path}")
bucket.put_object(oss_path, file_obj)
print(f"Put file to oss done...")
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Close the in-memory file object
file_obj.close()
Running Tests
Click Run to test the code and verify that the results meet expectations.
Schedule Configuration and Task Publishing
Since the gharchive website generates a new file every hour, set the scheduling interval to 1 hour.

Then click Submit to complete the publishing. With this, the Python task will periodically sync gharchive files to cloud object storage OSS.
Full Sync via Backfill
Scheduled tasks run periodically starting from the specified time to acquire data. To obtain full data before this time point, you can use the same code and task to perform a "backfill" operation, batch-syncing all data before the first scheduled cycle, thereby achieving a full sync. This approach is very convenient and ensures logical consistency through the same codebase.
Click Operations, enter the scheduled task's operations page, then click Backfill.
Files on gharchive started being generated on 2012-02-12, so set the backfill task start time to 2012-02-12 00:00:00.
The periodic scheduling of this task starts at 11:00 on 2024-06-18, so set the backfill task end time to 2024-06-18 11:00:00.

Preview the instances generated by the backfill task. A total of 108,251 task instances will be created. This means there are 108,251 hours in the above time range, and the backfill operation will sync 108,251 files from the gharchive website to cloud object storage.

Task Orchestration
In subsequent task development, you can set this task as a dependency to implement workflow orchestration.