her.esy.fun/src/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/index.html

162 lines
12 KiB
HTML
Raw Normal View History

2021-04-18 10:23:24 +00:00
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>YBlog - Pragmatic Regular Expression Exclude (2)</title>
<meta name="keywords" content="regexp, regular expression" />
<link rel="shortcut icon" type="image/x-icon" href="../../../../Scratch/img/favicon.ico" />
2021-05-25 20:25:47 +00:00
<link rel="stylesheet" type="text/css" href="/css/y.css" />
<link rel="stylesheet" type="text/css" href="/css/legacy.css" />
<link rel="alternate" type="application/rss+xml" title="RSS" href="/rss.xml" />
2021-04-18 10:23:24 +00:00
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="apple-touch-icon" href="../../../../Scratch/img/about/FlatAvatar@2x.png" />
<!--[if lt IE 9]>
<script src="http://ie7-js.googlecode.com/svn/version/2.1(beta4)/IE9.js"></script>
<![endif]-->
<!-- IndieAuth -->
<link href="https://twitter.com/yogsototh" rel="me">
<link href="https://github.com/yogsototh" rel="me">
<link href="mailto:yann.esposito@gmail.com" rel="me">
<link rel="pgpkey" href="../../../../pubkey.txt">
</head>
<body lang="en" class="article">
<div id="content">
<div id="header">
<div id="choix">
<span id="choixlang">
<a href="../../../../Scratch/fr/blog/2010-02-16-All-but-something-regexp--2-/">French</a>
</span>
<span class="tomenu"><a href="#navigation">↓ Menu ↓</a></span>
<span class="flush"></span>
</div>
</div>
<div id="titre">
<h1>Pragmatic Regular Expression Exclude (2)</h1>
</div>
<div class="flush"></div>
<div id="afterheader" class="article">
<div class="corps">
<p>In my <a href="../../../../Scratch/en/blog/2010-02-15-All-but-something-regexp">previous post</a> I had given some trick to match all except something. On the same idea, the trick to match the smallest possible string. Say you want to match the string between a and b, for example, you want to match:</p>
<pre>
a.....<strong class="blue">a......b</strong>..b..a....<strong class="blue">a....b</strong>...
</pre>
<p>Here are two common errors and a solution:</p>
<pre>
/a.*b/
<strong class="red">a.....a......b..b..a....a....b</strong>...
</pre>
<p>The first error is to use the <em>evil</em> <code>.*</code>. Because you will match from the first to the last.</p>
<pre>
/a.*?b/
<strong class="red">a.....a......b</strong>..b..<strong class="red">a....a....b</strong>...
</pre>
<p>The next natural way, is to change the <em>greediness</em>. But it is not enough as you will match from the first <code>a</code> to the first <code>b</code>. Then a simple constatation is that our matching string shouldnt contain any <code>a</code> nor <code>b</code>. Which lead to the last elegant solution.</p>
<pre>
/a[^ab]*b/
a.....<strong class="blue">a......b</strong>..b..a....<strong class="blue">a....b</strong>...
</pre>
<p>Until now, that was, easy. Now, just pass at the case you need to match not between <code>a</code> and <code>b</code>, but between strings. For example:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode html"><code class="sourceCode html"><a class="sourceLine" id="cb1-1" title="1"><span class="kw">&lt;li&gt;</span>...<span class="kw">&lt;li&gt;</span></a></code></pre></div>
<p>This is a bit difficult. You need to match</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode html"><code class="sourceCode html"><a class="sourceLine" id="cb2-1" title="1"><span class="kw">&lt;li&gt;</span>[anything not containing <span class="kw">&lt;li&gt;</span>]<span class="kw">&lt;/li&gt;</span></a></code></pre></div>
<p>The first method would be to use the same reasoning as in my <a href="../../../../Scratch/en/blog/2010-02-15-All-but-something-regexp">previous post</a>. Here is a first try:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode perl"><code class="sourceCode perl"><a class="sourceLine" id="cb3-1" title="1">&lt;li&gt;([^&lt;]|&lt;[^l]|&lt;l[^i]|&lt;li[^&gt;])<span class="dt">*&lt;</span><span class="kw">/</span><span class="ot">li&gt;</span></a></code></pre></div>
<p>But what about the following string:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode html"><code class="sourceCode html"><a class="sourceLine" id="cb4-1" title="1"><span class="kw">&lt;li&gt;</span>...<span class="kw">&lt;li</span><span class="er">&lt;/li</span><span class="kw">&gt;</span></a></code></pre></div>
<p>That string should not match. This is why if we really want to match it correctly<sup><a href="#note1"></a></sup> we need to add:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode perl"><code class="sourceCode perl"><a class="sourceLine" id="cb5-1" title="1">&lt;li&gt;([^&lt;]|&lt;[^l]|&lt;l[^i]|&lt;li[^&gt;])*(|&lt;|&lt;l|&lt;li)&lt;/li&gt;</a></code></pre></div>
<p>Yes a bit complicated. But what if the string I wanted to match was even longer?</p>
<p>Here is the algorithm way to handle this easily. You reduce the problem to the first one letter matching:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode perl"><code class="sourceCode perl"><a class="sourceLine" id="cb6-1" title="1"><span class="co"># transform a simple randomly choosen character</span></a>
<a class="sourceLine" id="cb6-2" title="2"><span class="co"># to an unique ID</span></a>
<a class="sourceLine" id="cb6-3" title="3"><span class="co"># (you should verify the identifier is REALLY unique)</span></a>
<a class="sourceLine" id="cb6-4" title="4"><span class="co"># beware the unique ID must not contain the</span></a>
<a class="sourceLine" id="cb6-5" title="5"><span class="co"># choosen character</span></a>
<a class="sourceLine" id="cb6-6" title="6"><span class="kw">s/</span><span class="ot">X</span><span class="kw">/</span><span class="st">_was_x_</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-7" title="7"><span class="kw">s/</span><span class="ot">Y</span><span class="kw">/</span><span class="st">_was_y_</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-8" title="8"></a>
<a class="sourceLine" id="cb6-9" title="9"><span class="co"># transform the long string in this simple character</span></a>
<a class="sourceLine" id="cb6-10" title="10"><span class="kw">s/</span><span class="ot">&lt;li&gt;</span><span class="kw">/</span><span class="st">X</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-11" title="11"><span class="kw">s/</span><span class="ot">&lt;\/li&gt;</span><span class="kw">/</span><span class="st">Y</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-12" title="12"></a>
<a class="sourceLine" id="cb6-13" title="13"><span class="co"># use the first method</span></a>
<a class="sourceLine" id="cb6-14" title="14"><span class="kw">s/</span><span class="ot">X</span><span class="ch">([^</span><span class="bn">X</span><span class="ch">]*)</span><span class="ot">Y</span><span class="kw">//g</span></a>
<a class="sourceLine" id="cb6-15" title="15"></a>
<a class="sourceLine" id="cb6-16" title="16"><span class="co"># retransform choosen letter by string</span></a>
<a class="sourceLine" id="cb6-17" title="17"><span class="kw">s/</span><span class="ot">X</span><span class="kw">/</span><span class="st">&lt;li&gt;</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-18" title="18"><span class="kw">s/</span><span class="ot">Y</span><span class="kw">/</span><span class="st">&lt;\/li&gt;</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-19" title="19"></a>
<a class="sourceLine" id="cb6-20" title="20"><span class="co"># retransform the choosen character back</span></a>
<a class="sourceLine" id="cb6-21" title="21"><span class="kw">s/</span><span class="ot">_was_x_</span><span class="kw">/</span><span class="st">X</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-22" title="22"><span class="kw">s/</span><span class="ot">_was_y_</span><span class="kw">/</span><span class="st">Y</span><span class="kw">/g</span></a></code></pre></div>
<p>And it works in only 9 lines for any beginning and ending string. This solution should look less <em>I AM THE GREAT REGEXP M45T3R, URAN00B</em>, but is more convenient in my humble opinion. Further more, using this last solution prove you master regexp, because you know it is difficult to manage such problems with only a regexp.</p>
<hr />
<p><small><a name="note1"><sup></sup></a> I know I used an HTML syntax example, but in my real life usage, I needed to match between <code>en:</code> and <code>::</code>. And sometimes the string could finish with <code>e::</code>.</small></p>
</div>
<div id="afterarticle">
<div id="social">
2021-05-25 20:25:47 +00:00
<a href="/rss.xml" target="_blank" rel="noopener noreferrer nofollow" class="social">RSS</a>
2021-04-18 10:23:24 +00:00
·
<a href="https://twitter.com/home?status=http%3A%2F%2Fyannesposito.com/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/%20via%20@yogsototh" target="_blank" rel="noopener noreferrer nofollow" class="social">Tweet</a>
·
<a href="http://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fyannesposito.com/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/" target="_blank" rel="noopener noreferrer nofollow" class="social">FB</a>
<br />
<a class="message" href="../../../../Scratch/en/blog/Social-link-the-right-way/">These social sharing links preserve your privacy</a>
</div>
<div id="navigation">
<a href="../../../../">Home</a>
<span class="sep">¦</span>
<a href="../../../../Scratch/en/blog">Blog</a>
<span class="sep">¦</span>
<a href="../../../../Scratch/en/softwares">Softwares</a>
<span class="sep">¦</span>
<a href="../../../../Scratch/en/about">About</a>
</div>
<div id="totop"><a href="#header">↑ Top ↑</a></div>
<div id="bottom">
<div>
Published on 2010-02-16
</div>
<div>
<a href="https://twitter.com/yogsototh">Follow @yogsototh</a>
</div>
<div>
<a rel="license" href="http://creativecommons.org/licenses/by/3.0/deed.en_US">Yann Esposito©</a>
</div>
<div>
Done with
<a href="http://www.vim.org" target="_blank" rel="noopener noreferrer nofollow"><strike>Vim</strike></a>
<a href="http://spacemacs.org" target="_blank" rel="noopener noreferrer nofollow">spacemacs</a>
<span class="pala">&amp;</span>
<a href="http://nanoc.ws" target="_blank" rel="noopener noreferrer nofollow"><strike>nanoc</strike></a>
<a href="http://jaspervdj.be/hakyll" target="_blank" rel="noopener noreferrer nofollow">Hakyll</a>
</div>
<hr />
<div style="max-width: 100%">
<a href="https://cardanohub.org">
<img src="../../../../Scratch/img/ada-logo.png" class="simple" style="height: 16px;
border-radius: 50%;
vertical-align:middle;
display:inline-block;" />
ADA:
</a>
<code style="display:inline-block;
word-wrap:break-word;
text-align: left;
vertical-align: top;
max-width: 85%;">
DdzFFzCqrhtAvdkmATx5Fm8NPJViDy85ZBw13p4XcNzVzvQg8e3vWLXq23JQWFxPEXK6Kvhaxxe7oJt4VMYHxpA2vtCFiP8fziohN6Yp
</code>
</div>
</div>
</div>
</div>
</div>
</body>
</html>