My approach (without any external libraries) would be along those lines:
import bs4
import re
html = '''
#Insert the long HTML provided above
'''
els = soup.find_all('p')
for el in els:
string = re.sub('[ n]+', ' ', el.text).strip()
print(string)
Basically you’re looking for all paragraph elements (in which the respective text elements are stored), you iterate over them and strip each individually (depends on which formatting you’d like to have removed – using Regex). Lastly, the string gets printed.
I have a large html file of lecture notes, I want to split this by the various definitions, theorems, etc. I have already managed to do this, however when I use the .get_text() function it gets both the unicode and the LaTeX code, is there an (elegant) way of splitting these?
Examples:
Raw HTML of a definition:
”’
<div class="ltx_theorem ltx_theorem_definition" id="Ch1.S1.Thmtheorem1">
<h6 class="ltx_title ltx_runin ltx_title_theorem">
<span class="ltx_tag ltx_tag_theorem"><span class="ltx_text ltx_font_bold">Definition 1.1.1</span></span> (Groups: First definition).</h6>
<div class="ltx_para" id="Ch1.S1.Thmtheorem1.p1">
<p class="ltx_p">A <span class="ltx_text ltx_font_bold">group</span> <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="8" jax="CHTML" role="presentation" style="font-size: 101.1%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.Thmtheorem1.p1.m1"><mjx-semantics><mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c28"></mjx-c></mjx-mo><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D43A TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n"><mjx-c class="mjx-c2C"></mjx-c></mjx-mo><mjx-mo class="mjx-n" space="2"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo><mjx-mo class="mjx-n"><mjx-c class="mjx-c29"></mjx-c></mjx-mo></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="(G,circ)" class="ltx_Math" display="inline" ><semantics><mrow><mo stretchy="false">(</mo><mi>G</mi><mo>,</mo><mo>∘</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(G,circ)</annotation></semantics></math></mjx-assistive-mml></mjx-container> is a non-empty set <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="9" jax="CHTML" role="presentation" style="font-size: 101.1%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.Thmtheorem1.p1.m2"><mjx-semantics><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D43A TEX-I"></mjx-c></mjx-mi></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="G" class="ltx_Math" display="inline" ><semantics><mi>G</mi><annotation encoding="application/x-tex">G</annotation></semantics></math></mjx-assistive-mml></mjx-container> together with a binary operation <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="10" jax="CHTML" role="presentation" style="font-size: 101.1%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.Thmtheorem1.p1.m3"><mjx-semantics><mjx-mo class="mjx-n"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="circ" class="ltx_Math" display="inline" ><semantics><mo>∘</mo><annotation encoding="application/x-tex">circ</annotation></semantics></math></mjx-assistive-mml></mjx-container> – called the <span class="ltx_text ltx_font_bold">“group law”</span> – satisfying:</p>
<ol class="ltx_enumerate" id="Ch1.S1.I1">
<li class="ltx_item" id="Ch1.S1.I1.i1">
<div class="ltx_para" id="Ch1.S1.I1.i1.p1">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Closure</span>: <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="11" jax="CHTML" role="presentation" style="font-size: 101.1%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.I1.i1.p1.m1"><mjx-semantics><mjx-mrow><mjx-mrow><mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c2200"></mjx-c></mjx-mo><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c2C"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="2"><mjx-c class="mjx-c1D466 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c2208"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="4"><mjx-c class="mjx-c1D43A TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="forall x,yin G" class="ltx_Math" display="inline" ><semantics><mrow><mrow><mrow><mo>∀</mo><mi>x</mi></mrow><mo>,</mo><mi>y</mi></mrow><mo>∈</mo><mi>G</mi></mrow><annotation encoding="application/x-tex">forall x,yin G</annotation></semantics></math></mjx-assistive-mml></mjx-container>: <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="12" jax="CHTML" role="presentation" style="font-size: 101.1%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.I1.i1.p1.m2"><mjx-semantics><mjx-mrow><mjx-mrow><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D466 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c2208"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="4"><mjx-c class="mjx-c1D43A TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="xcirc yin G" class="ltx_Math" display="inline" ><semantics><mrow><mrow><mi>x</mi><mo>∘</mo><mi>y</mi></mrow><mo>∈</mo><mi>G</mi></mrow><annotation encoding="application/x-tex">xcirc yin G</annotation></semantics></math></mjx-assistive-mml></mjx-container>.</p>
</div>
</li>
<li class="ltx_item" id="Ch1.S1.I1.i2">
<div class="ltx_para" id="Ch1.S1.I1.i2.p1">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Associativity</span>: <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="13" jax="CHTML" role="presentation" style="font-size: 101.1%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.I1.i2.p1.m1"><mjx-semantics><mjx-mrow><mjx-mrow><mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c2200"></mjx-c></mjx-mo><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c2C"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="2"><mjx-c class="mjx-c1D466 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n"><mjx-c class="mjx-c2C"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="2"><mjx-c class="mjx-c1D467 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c2208"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="4"><mjx-c class="mjx-c1D43A TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="forall x,y,zin G" class="ltx_Math" display="inline" ><semantics><mrow><mrow><mrow><mo>∀</mo><mi>x</mi></mrow><mo>,</mo><mi>y</mi><mo>,</mo><mi>z</mi></mrow><mo>∈</mo><mi>G</mi></mrow><annotation encoding="application/x-tex">forall x,y,zin G</annotation></semantics></math></mjx-assistive-mml></mjx-container>: <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="14" jax="CHTML" role="presentation" style="font-size: 101.1%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.I1.i2.p1.m2"><mjx-semantics><mjx-mrow><mjx-mrow><mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c28"></mjx-c></mjx-mo><mjx-mrow><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D466 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c29"></mjx-c></mjx-mo></mjx-mrow><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D467 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c3D"></mjx-c></mjx-mo><mjx-mrow space="4"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo><mjx-mrow space="3"><mjx-mo class="mjx-n"><mjx-c class="mjx-c28"></mjx-c></mjx-mo><mjx-mrow><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D466 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D467 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c29"></mjx-c></mjx-mo></mjx-mrow></mjx-mrow></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="(xcirc y)circ z=xcirc(ycirc z)" class="ltx_Math" display="inline" ><semantics><mrow><mrow><mrow><mo stretchy="false">(</mo><mrow><mi>x</mi><mo>∘</mo><mi>y</mi></mrow><mo stretchy="false">)</mo></mrow><mo>∘</mo><mi>z</mi></mrow><mo>=</mo><mrow><mi>x</mi><mo>∘</mo><mrow><mo stretchy="false">(</mo><mrow><mi>y</mi><mo>∘</mo><mi>z</mi></mrow><mo stretchy="false">)</mo></mrow></mrow></mrow><annotation encoding="application/x-tex">(xcirc y)circ z=xcirc(ycirc z)</annotation></semantics></math></mjx-assistive-mml></mjx-container>.</p>
</div>
</li>
<li class="ltx_item" id="Ch1.S1.I1.i3">
<div class="ltx_para" id="Ch1.S1.I1.i3.p1">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Existence of identity element</span>: <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="15" jax="CHTML" role="presentation" style="font-size: 101.3%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.I1.i3.p1.m1"><mjx-semantics><mjx-mrow><mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c2203"></mjx-c></mjx-mo><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D452 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c2208"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="4"><mjx-c class="mjx-c1D43A TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="exists ein G" class="ltx_Math" display="inline" ><semantics><mrow><mrow><mo>∃</mo><mi>e</mi></mrow><mo>∈</mo><mi>G</mi></mrow><annotation encoding="application/x-tex">exists ein G</annotation></semantics></math></mjx-assistive-mml></mjx-container>, <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="16" jax="CHTML" role="presentation" style="font-size: 101.3%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.I1.i3.p1.m2"><mjx-semantics><mjx-mrow><mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c2200"></mjx-c></mjx-mo><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c2208"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="4"><mjx-c class="mjx-c1D43A TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="forall xin G" class="ltx_Math" display="inline" ><semantics><mrow><mrow><mo>∀</mo><mi>x</mi></mrow><mo>∈</mo><mi>G</mi></mrow><annotation encoding="application/x-tex">forall xin G</annotation></semantics></math></mjx-assistive-mml></mjx-container>: <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="17" jax="CHTML" role="presentation" style="font-size: 101.3%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.I1.i3.p1.m3"><mjx-semantics><mjx-mrow><mjx-mrow><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D452 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c3D"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="4"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c3D"></mjx-c></mjx-mo><mjx-mrow space="4"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D452 TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="ecirc x=x=xcirc e" class="ltx_Math" display="inline" ><semantics><mrow><mrow><mi>e</mi><mo>∘</mo><mi>x</mi></mrow><mo>=</mo><mi>x</mi><mo>=</mo><mrow><mi>x</mi><mo>∘</mo><mi>e</mi></mrow></mrow><annotation encoding="application/x-tex">ecirc x=x=xcirc e</annotation></semantics></math></mjx-assistive-mml></mjx-container>.</p>
</div>
</li>
<li class="ltx_item" id="Ch1.S1.I1.i4">
<div class="ltx_para" id="Ch1.S1.I1.i4.p1">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Existence of inverses</span>: <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="18" jax="CHTML" role="presentation" style="font-size: 101.1%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.I1.i4.p1.m1"><mjx-semantics><mjx-mrow><mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c2200"></mjx-c></mjx-mo><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c2208"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="4"><mjx-c class="mjx-c1D43A TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="forall xin G" class="ltx_Math" display="inline" ><semantics><mrow><mrow><mo>∀</mo><mi>x</mi></mrow><mo>∈</mo><mi>G</mi></mrow><annotation encoding="application/x-tex">forall xin G</annotation></semantics></math></mjx-assistive-mml></mjx-container>, <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="19" jax="CHTML" role="presentation" style="font-size: 101.1%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.I1.i4.p1.m2"><mjx-semantics><mjx-mrow><mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c2203"></mjx-c></mjx-mo><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D466 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c2208"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="4"><mjx-c class="mjx-c1D43A TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="exists yin G" class="ltx_Math" display="inline" ><semantics><mrow><mrow><mo>∃</mo><mi>y</mi></mrow><mo>∈</mo><mi>G</mi></mrow><annotation encoding="application/x-tex">exists yin G</annotation></semantics></math></mjx-assistive-mml></mjx-container>: <mjx-container class="MathJax CtxtMenu_Attached_0" ctxtmenu_counter="20" jax="CHTML" role="presentation" style="font-size: 101.1%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" class="ltx_Math MJX-TEX" id="Ch1.S1.I1.i4.p1.m3"><mjx-semantics><mjx-mrow><mjx-mrow><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D466 TEX-I"></mjx-c></mjx-mi></mjx-mrow><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c3D"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="4"><mjx-c class="mjx-c1D452 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c3D"></mjx-c></mjx-mo><mjx-mrow space="4"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D466 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2218"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" role="presentation" unselectable="on"><math alttext="xcirc y=e=ycirc x" class="ltx_Math" display="inline" ><semantics><mrow><mrow><mi>x</mi><mo>∘</mo><mi>y</mi></mrow><mo>=</mo><mi>e</mi><mo>=</mo><mrow><mi>y</mi><mo>∘</mo><mi>x</mi></mrow></mrow><annotation encoding="application/x-tex">xcirc y=e=ycirc x</annotation></semantics></math></mjx-assistive-mml></mjx-container>.
</p>
</div>
</li>
</ol>
</div>
</div>
”’
and then running the .get_text() we get:
Definition 1.1.1 (Groups: First definition).
A group (G,∘)(G,circ) is a non-empty set GG together with a binary
operation ∘circ – called the “group law” – satisfying:
Closure: ∀x,y∈Gforall x,yin G: x∘y∈Gxcirc yin G.
Associativity: ∀x,y,z∈Gforall x,y,zin G: (x∘y)∘z=x∘(y∘z)(xcirc
y)circ z=xcirc(ycirc z).
Existence of identity element: ∃e∈Gexists ein G, ∀x∈Gforall xin G:
e∘x=x=x∘eecirc x=x=xcirc e.
Existence of inverses: ∀x∈Gforall xin G, ∃y∈Gexists yin G:
x∘y=e=y∘xxcirc y=e=ycirc x.
Wanted output:
Definition 1.1.1 (Groups: First definition).
A group (G,circ) is a non-empty set G together with a binary
operation circ – called the “group law” – satisfying:
Closure: forall x,yin G: xcirc yin G.
Associativity: forall x,y,zin G: (xcirc
y)circ z=xcirc(ycirc z).
Existence of identity element: exists ein G, forall xin G:
eecirc x=x=xcirc e.
Existence of inverses: forall xin G, exists yin G:
xcirc y=e=ycirc x.
Initially, I had the text saved to a file and then manually edited it, I’m looking for a more elegant solution as this is only the 2nd time I’ve used the bs4 module and don’t know what to look for in the documentation.
Just to clarify, can you edit your question and add a couple of more examples of your expected output?
If you can use lxml instead of beautifulsoup, this can probably be done.