Solution 1 :

There is no built in source PTransform for fetching the HTML content of websites. That said, you can write your own DoFn (https://beam.apache.org/documentation/programming-guide/#requirements-for-writing-user-code-for-beam-transforms).

Your pipeline would then look something like the following:

with beam.Pipeline() as p:
    result = (p
       | "Create input data" >> beam.Create([list_of_urls_to_fetch])
       | "Fetch HTML Content" >> beam.Map(CustomDoFn)

Where the CustomDoFn receives the list of URLs as input, and fetches their HTML content using your library of choosing.

Problem :

I am willing to use Apache Beam to get data input from a URL instead of a file. I could not find some built-in methods for it. Is there any way to do it?

Comments

Comment posted by AMargheriti

Can you provide some more info on what type of data i sent from the URL?

Comment posted by olx.com.pk/item/…

@AMargheriti Thanks for your comment. For instance this URL (

By