her.esy.fun/src/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/index.html
Yann Esposito (Yogsototh) 03610908ce
Old site match new style
2021-05-25 22:25:47 +02:00

162 lines
12 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>YBlog - Pragmatic Regular Expression Exclude (2)</title>
<meta name="keywords" content="regexp, regular expression" />
<link rel="shortcut icon" type="image/x-icon" href="../../../../Scratch/img/favicon.ico" />
<link rel="stylesheet" type="text/css" href="/css/y.css" />
<link rel="stylesheet" type="text/css" href="/css/legacy.css" />
<link rel="alternate" type="application/rss+xml" title="RSS" href="/rss.xml" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="apple-touch-icon" href="../../../../Scratch/img/about/FlatAvatar@2x.png" />
<!--[if lt IE 9]>
<script src="http://ie7-js.googlecode.com/svn/version/2.1(beta4)/IE9.js"></script>
<![endif]-->
<!-- IndieAuth -->
<link href="https://twitter.com/yogsototh" rel="me">
<link href="https://github.com/yogsototh" rel="me">
<link href="mailto:yann.esposito@gmail.com" rel="me">
<link rel="pgpkey" href="../../../../pubkey.txt">
</head>
<body lang="en" class="article">
<div id="content">
<div id="header">
<div id="choix">
<span id="choixlang">
<a href="../../../../Scratch/fr/blog/2010-02-16-All-but-something-regexp--2-/">French</a>
</span>
<span class="tomenu"><a href="#navigation">↓ Menu ↓</a></span>
<span class="flush"></span>
</div>
</div>
<div id="titre">
<h1>Pragmatic Regular Expression Exclude (2)</h1>
</div>
<div class="flush"></div>
<div id="afterheader" class="article">
<div class="corps">
<p>In my <a href="../../../../Scratch/en/blog/2010-02-15-All-but-something-regexp">previous post</a> I had given some trick to match all except something. On the same idea, the trick to match the smallest possible string. Say you want to match the string between a and b, for example, you want to match:</p>
<pre>
a.....<strong class="blue">a......b</strong>..b..a....<strong class="blue">a....b</strong>...
</pre>
<p>Here are two common errors and a solution:</p>
<pre>
/a.*b/
<strong class="red">a.....a......b..b..a....a....b</strong>...
</pre>
<p>The first error is to use the <em>evil</em> <code>.*</code>. Because you will match from the first to the last.</p>
<pre>
/a.*?b/
<strong class="red">a.....a......b</strong>..b..<strong class="red">a....a....b</strong>...
</pre>
<p>The next natural way, is to change the <em>greediness</em>. But it is not enough as you will match from the first <code>a</code> to the first <code>b</code>. Then a simple constatation is that our matching string shouldnt contain any <code>a</code> nor <code>b</code>. Which lead to the last elegant solution.</p>
<pre>
/a[^ab]*b/
a.....<strong class="blue">a......b</strong>..b..a....<strong class="blue">a....b</strong>...
</pre>
<p>Until now, that was, easy. Now, just pass at the case you need to match not between <code>a</code> and <code>b</code>, but between strings. For example:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode html"><code class="sourceCode html"><a class="sourceLine" id="cb1-1" title="1"><span class="kw">&lt;li&gt;</span>...<span class="kw">&lt;li&gt;</span></a></code></pre></div>
<p>This is a bit difficult. You need to match</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode html"><code class="sourceCode html"><a class="sourceLine" id="cb2-1" title="1"><span class="kw">&lt;li&gt;</span>[anything not containing <span class="kw">&lt;li&gt;</span>]<span class="kw">&lt;/li&gt;</span></a></code></pre></div>
<p>The first method would be to use the same reasoning as in my <a href="../../../../Scratch/en/blog/2010-02-15-All-but-something-regexp">previous post</a>. Here is a first try:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode perl"><code class="sourceCode perl"><a class="sourceLine" id="cb3-1" title="1">&lt;li&gt;([^&lt;]|&lt;[^l]|&lt;l[^i]|&lt;li[^&gt;])<span class="dt">*&lt;</span><span class="kw">/</span><span class="ot">li&gt;</span></a></code></pre></div>
<p>But what about the following string:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode html"><code class="sourceCode html"><a class="sourceLine" id="cb4-1" title="1"><span class="kw">&lt;li&gt;</span>...<span class="kw">&lt;li</span><span class="er">&lt;/li</span><span class="kw">&gt;</span></a></code></pre></div>
<p>That string should not match. This is why if we really want to match it correctly<sup><a href="#note1"></a></sup> we need to add:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode perl"><code class="sourceCode perl"><a class="sourceLine" id="cb5-1" title="1">&lt;li&gt;([^&lt;]|&lt;[^l]|&lt;l[^i]|&lt;li[^&gt;])*(|&lt;|&lt;l|&lt;li)&lt;/li&gt;</a></code></pre></div>
<p>Yes a bit complicated. But what if the string I wanted to match was even longer?</p>
<p>Here is the algorithm way to handle this easily. You reduce the problem to the first one letter matching:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode perl"><code class="sourceCode perl"><a class="sourceLine" id="cb6-1" title="1"><span class="co"># transform a simple randomly choosen character</span></a>
<a class="sourceLine" id="cb6-2" title="2"><span class="co"># to an unique ID</span></a>
<a class="sourceLine" id="cb6-3" title="3"><span class="co"># (you should verify the identifier is REALLY unique)</span></a>
<a class="sourceLine" id="cb6-4" title="4"><span class="co"># beware the unique ID must not contain the</span></a>
<a class="sourceLine" id="cb6-5" title="5"><span class="co"># choosen character</span></a>
<a class="sourceLine" id="cb6-6" title="6"><span class="kw">s/</span><span class="ot">X</span><span class="kw">/</span><span class="st">_was_x_</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-7" title="7"><span class="kw">s/</span><span class="ot">Y</span><span class="kw">/</span><span class="st">_was_y_</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-8" title="8"></a>
<a class="sourceLine" id="cb6-9" title="9"><span class="co"># transform the long string in this simple character</span></a>
<a class="sourceLine" id="cb6-10" title="10"><span class="kw">s/</span><span class="ot">&lt;li&gt;</span><span class="kw">/</span><span class="st">X</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-11" title="11"><span class="kw">s/</span><span class="ot">&lt;\/li&gt;</span><span class="kw">/</span><span class="st">Y</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-12" title="12"></a>
<a class="sourceLine" id="cb6-13" title="13"><span class="co"># use the first method</span></a>
<a class="sourceLine" id="cb6-14" title="14"><span class="kw">s/</span><span class="ot">X</span><span class="ch">([^</span><span class="bn">X</span><span class="ch">]*)</span><span class="ot">Y</span><span class="kw">//g</span></a>
<a class="sourceLine" id="cb6-15" title="15"></a>
<a class="sourceLine" id="cb6-16" title="16"><span class="co"># retransform choosen letter by string</span></a>
<a class="sourceLine" id="cb6-17" title="17"><span class="kw">s/</span><span class="ot">X</span><span class="kw">/</span><span class="st">&lt;li&gt;</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-18" title="18"><span class="kw">s/</span><span class="ot">Y</span><span class="kw">/</span><span class="st">&lt;\/li&gt;</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-19" title="19"></a>
<a class="sourceLine" id="cb6-20" title="20"><span class="co"># retransform the choosen character back</span></a>
<a class="sourceLine" id="cb6-21" title="21"><span class="kw">s/</span><span class="ot">_was_x_</span><span class="kw">/</span><span class="st">X</span><span class="kw">/g</span></a>
<a class="sourceLine" id="cb6-22" title="22"><span class="kw">s/</span><span class="ot">_was_y_</span><span class="kw">/</span><span class="st">Y</span><span class="kw">/g</span></a></code></pre></div>
<p>And it works in only 9 lines for any beginning and ending string. This solution should look less <em>I AM THE GREAT REGEXP M45T3R, URAN00B</em>, but is more convenient in my humble opinion. Further more, using this last solution prove you master regexp, because you know it is difficult to manage such problems with only a regexp.</p>
<hr />
<p><small><a name="note1"><sup></sup></a> I know I used an HTML syntax example, but in my real life usage, I needed to match between <code>en:</code> and <code>::</code>. And sometimes the string could finish with <code>e::</code>.</small></p>
</div>
<div id="afterarticle">
<div id="social">
<a href="/rss.xml" target="_blank" rel="noopener noreferrer nofollow" class="social">RSS</a>
·
<a href="https://twitter.com/home?status=http%3A%2F%2Fyannesposito.com/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/%20via%20@yogsototh" target="_blank" rel="noopener noreferrer nofollow" class="social">Tweet</a>
·
<a href="http://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fyannesposito.com/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/" target="_blank" rel="noopener noreferrer nofollow" class="social">FB</a>
<br />
<a class="message" href="../../../../Scratch/en/blog/Social-link-the-right-way/">These social sharing links preserve your privacy</a>
</div>
<div id="navigation">
<a href="../../../../">Home</a>
<span class="sep">¦</span>
<a href="../../../../Scratch/en/blog">Blog</a>
<span class="sep">¦</span>
<a href="../../../../Scratch/en/softwares">Softwares</a>
<span class="sep">¦</span>
<a href="../../../../Scratch/en/about">About</a>
</div>
<div id="totop"><a href="#header">↑ Top ↑</a></div>
<div id="bottom">
<div>
Published on 2010-02-16
</div>
<div>
<a href="https://twitter.com/yogsototh">Follow @yogsototh</a>
</div>
<div>
<a rel="license" href="http://creativecommons.org/licenses/by/3.0/deed.en_US">Yann Esposito©</a>
</div>
<div>
Done with
<a href="http://www.vim.org" target="_blank" rel="noopener noreferrer nofollow"><strike>Vim</strike></a>
<a href="http://spacemacs.org" target="_blank" rel="noopener noreferrer nofollow">spacemacs</a>
<span class="pala">&amp;</span>
<a href="http://nanoc.ws" target="_blank" rel="noopener noreferrer nofollow"><strike>nanoc</strike></a>
<a href="http://jaspervdj.be/hakyll" target="_blank" rel="noopener noreferrer nofollow">Hakyll</a>
</div>
<hr />
<div style="max-width: 100%">
<a href="https://cardanohub.org">
<img src="../../../../Scratch/img/ada-logo.png" class="simple" style="height: 16px;
border-radius: 50%;
vertical-align:middle;
display:inline-block;" />
ADA:
</a>
<code style="display:inline-block;
word-wrap:break-word;
text-align: left;
vertical-align: top;
max-width: 85%;">
DdzFFzCqrhtAvdkmATx5Fm8NPJViDy85ZBw13p4XcNzVzvQg8e3vWLXq23JQWFxPEXK6Kvhaxxe7oJt4VMYHxpA2vtCFiP8fziohN6Yp
</code>
</div>
</div>
</div>
</div>
</div>
</body>
</html>