Solution 1 :

There is no built in source PTransform for fetching the HTML content of websites. That said, you can write your own DoFn (

Your pipeline would then look something like the following:

with beam.Pipeline() as p:
    result = (p
       | "Create input data" >> beam.Create([list_of_urls_to_fetch])
       | "Fetch HTML Content" >> beam.Map(CustomDoFn)

Where the CustomDoFn receives the list of URLs as input, and fetches their HTML content using your library of choosing.

Problem :

I am willing to use Apache Beam to get data input from a URL instead of a file. I could not find some built-in methods for it. Is there any way to do it?


Comment posted by AMargheriti

Can you provide some more info on what type of data i sent from the URL?

Comment posted by…

@AMargheriti Thanks for your comment. For instance this URL (