Solution 1 :

With MathJax, the equation is actually there in TeX notation initially. The spans are created by MathJax Javascript for the equation layout in HTML. Currently, you let MathJax first render the equation, grab the rendered equation and then try to convert it back to the original TeX equation. It would be more straightforward to directly read the TeX equation without the indirection of Javascript rendering.

To achieve that, you would just need to disable Javascript in Selenium. For example with the Firefox driver this should do the trick:

from selenium.webdriver.firefox.options import Options
from selenium import webdriver

opts = Options()
opts.preferences.update({
    "javascript.enabled": False,
})
driver = webdriver.Firefox(options=opts)

Alternatively, if you need to process the rendered version with Javascript enabled for some reason, you could try to get hold of the content of the script element inside the <p>. It contains the full equation, but without TeX math markup:

<p class="q_question">...<script type="math/tex">(2x+5)^2(x-4)</script>...</p>

This way you would not have to remove the spans. You would then need to enclose it in TeX math markup (...) for the PDF.

Problem :

i have been trying to convert the HTML string question_text_html(which is a mathematical question written in HTML ) in the code below to a latex string using pypandoc. but it keeps including the irrelevant strings like “protecthypertarget{MJX-…}…..” in the converted string

import pypandoc
from selenium import webdriver

driver.get("https://nigerianscholars.com/past-questions/mathematics/? 
    show_answers=yes")
question_blocks=driver.find_elements_by_class_name('question_block')
for question_block in question_blocks:
 question_text=question_block.find_element_by_class_name('question_text')
 question_text_html=question_text.get_attribute('innerHTML')
 question_latex=pypandoc.convert_text(question_text_html,'tex',format='html')
 print(f'Question Html is {question_text_html}')
 print(f'Question latex is {question_latex}')
 

it usually gives

 Question Html is <html><body><p class="q_question">Differentiate <span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='&lt;math &gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;msup&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;&amp;#x2212;&lt;/mo&gt;&lt;mn&gt;4&lt;/mn&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/math&gt;' id="MathJax-Element-1-Frame" role="presentation" style="font-size: 114%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-1"><span class="mjx-mrow" id="MJXc-Node-2"><span class="mjx-mo" id="MJXc-Node-3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">(</span></span><span class="mjx-mn" id="MJXc-Node-4"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">2</span></span><span class="mjx-mi" id="MJXc-Node-5"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.221em; padding-bottom: 0.309em;">x</span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-6"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.309em; padding-bottom: 0.441em;">+</span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-7"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">5</span></span><span class="mjx-msubsup" id="MJXc-Node-8"><span class="mjx-base"><span class="mjx-mo" id="MJXc-Node-9"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">)</span></span></span><span class="mjx-sup" style="font-size: 70.7%; vertical-align: 0.513em; padding-left: 0px; padding-right: 0.071em;"><span class="mjx-mn" id="MJXc-Node-10" style=""><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">2</span></span></span></span><span class="mjx-mo" id="MJXc-Node-11"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">(</span></span><span class="mjx-mi" id="MJXc-Node-12"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.221em; padding-bottom: 0.309em;">x</span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-13"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.309em; padding-bottom: 0.441em;">−</span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-14"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">4</span></span><span class="mjx-mo" id="MJXc-Node-15"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">)</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math ><mo stretchy="false">(</mo><mn>2</mn><mi>x</mi><mo>+</mo><mn>5</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo stretchy="false">(</mo><mi>x</mi><mo>−</mo><mn>4</mn><mo stretchy="false">)</mo></math></span></span><script id="MathJax-Element-1" type="math/tex">(2x+5)^2(x-4)</script> with respect to x.</p></body></html>






Question latex is Differentiate
{}protecthypertarget{MathJax-Element-1-Frame}{}{protecthypertarget{MJXc-Node-1}{}{protecthypertarget{MJXc-Node-2}{}{protecthypertarget{MJXc-Node-3}{}{{(}}protecthypertarget{MJXc-Node-4}{}{{2}}protecthypertarget{MJXc-Node-5}{}{{x}}protecthypertarget{MJXc-Node-6}{}{{+}}protecthypertarget{MJXc-Node-7}{}{{5}}protecthypertarget{MJXc-Node-8}{}{{protecthypertarget{MJXc-Node-9}{}{{)}}}{protecthypertarget{MJXc-Node-10}{}{{2}}}}protecthypertarget{MJXc-Node-11}{}{{(}}protecthypertarget{MJXc-Node-12}{}{{x}}protecthypertarget{MJXc-Node-13}{}{{−}}protecthypertarget{MJXc-Node-14}{}{{4}}protecthypertarget{MJXc-Node-15}{}{{)}}}}{((2x + 5)^{2}(x - 4))}}((2x+5)^2(x-4))
with respect to x.

How can i remove all the “protecthypertarget{MJXc-Node-10}” from the latex leaving only

Differentiate {((2x + 5)^{2}(x - 4))}}((2x+5)^2(x-4))
with respect to x.

Comments

Comment posted by tarleb

Could you boil this down to a simpler example? I’m not going to debug a program containing irrelevant details, or to endlessly scroll through vertical text to find out what’s going on. But I’ll be happy to help if it’s clear what the question is.

Comment posted by msughter

sorry, i have edited the post

Comment posted by hlg

Now there seems to be something missing in the html output. I assume it has a

Comment posted by msughter

i have posted the full html for the question,its a bit much but that was the shortest question I could find,

Comment posted by msughter

i also tried to remove all the span elements in the equation but the converter returns an empty latex ….{}

By