Solution 1 :

f.readlines() does iterate over every line. You need to remove m to get all lines

You could also iterate the file directly

with open('simplewiki.xml', 'r') as f: 
    for wiki_link in f:
        #... 

But this assumes a standard, line-delimited text file, not actually semi-structured (X)HTML

Problem :

I’m trying to build a small tool to scrape from the Wikipedia API.

I have a few links in an XML file (which is saved as wikiepedia.html) that I’d like to iterate over (line-by-line) and obtain an output which I can then json.dump.

The wikipedia.html file contains links to the Simple Wikipedia API for certain pages (there are thousands of lines in the xml file, but they’re of an identical format):

What I’d like to do is be able to start off with the first line in the html file, scrape the data from the Wikipedia API and then return the desired contents as a dictionary. I would then like to store this as a json.

Once that’s all been done, I’d then like for the script to go the second line in the html file and do the same thing, and so on (iterate over every line until it reaches the end of the html file).

Currently I can return a dictionary in the format that I want by iterating over every line (the script keeps spitting out dictionaries). However, I’m unsure of how to store each dictionary as a json during each iteration & then for the script to parse the next line in the .xml file and do the same thing.

For example, if I start my current script, the first line in the html file gives the following dictionary:

I then would like to save this as a json.

I then would like to be able to save the second dictionary that’s outputted from the script as it reads the second line as another separate json and so forth.

If anyone knows how to take each dictionary that’s outputted and save them as separate json files that can be uploaded to GitHub then that’d be amazing, thank you.

Comments

Comment posted by OneCricketeer

XML files should actually be parsed (using XPath, for example), not iterated line by line, but you’re already iterating all the lines, so what’s the issue? You seem to just have a plaintext file since there’s no XML tags

Comment posted by Mark Land

Hmm, what I’m trying to do is output a dictionary from my script, and then save that as a single json file. I then want to be able to do that for all dictionaries that are outputted from the script (but every dictionary should be saved as a separate json).

Comment posted by minimal reproducible example

Please post sample of XML file for

Comment posted by Mark Land

Thanks for the response — I tried doing the first command, but unfortunately ran into this error before I posted this which I wasn’t sure how to solve: “` File “stack_overflow.py”, line 8, in with urllib.request.urlopen(wiki_link) as url: File “/usr/lib/python3.8/urllib/request.py”, line 222, in urlopen return opener.open(url, data, timeout) File “/usr/lib/python3.8/urllib/request.py”, line 515, in open req.timeout = timeout AttributeError: ‘list’ object has no attribute ‘timeout’ “`

Comment posted by Mark Land

With regards to the second suggestion, I am able to iterate over every line, but I was hoping to be able to convert all these dictionary outputs into separate json files:

Comment posted by simple.wikipedia.org/wiki/April

Such as a single json file for: {‘title’: ‘April’, ‘source’: ‘

Comment posted by minimal reproducible example

If you’re able to get one url working (i.e create a

By