After a warning: Search the entire WordPress website in the source text for a word

Files with certain extensions in the file name can be filtered out very easily with regular expressions.

Yesterday I had a problem again that at first glance couldn't be solved. A customer has received a warning from a lawyer that he is no longer allowed to use a certain word on his website.

Do you have a warning because you used a word several times on your WordPress blog that you are not allowed to use? Then I have the instructions for you on how to solve the problem.

Do you have a warning because you used a word several times on your WordPress blog that you are not allowed to use? Then I have the instructions for you on how to solve the problem.

Now you could think: No problem, there is a search function for pages and posts in WordPress. That is also true, but unfortunately the search function does not include sliders, meta tags or the like, which is managed via plugins in the database. Permanently programmed content in the sidebar and in the footer area or in any PHP theme files is also not recorded.

Important note: I am not offering any legal advice or the like. To do this, please contact a lawyer. My guide is just a technical solution on how to find a specific word in a WordPress blog that you want to remove or change.

Google search only partially helpful

The Google search with the search query ala "site: www.sir-apfelot.de bad word"is unfortunately not of any help, as it only ejects archived data that is a few days old. And there is no type of search on Google that ensures that you have not overlooked anything when making changes to the website. You would have to wait for that. until the Google bot has re-indexed all pages and then search again.

If you want to search a website with the Google search, you work with the "site:" operator. Since I can't name the customer's website, I've searched the Sir Apfelot blog for a "warning" here.

If you want to search a website with the Google search, you work with the "site:" operator. Since I can't name the customer's website, I've searched the Sir Apfelot blog for a "warning" here.

Unfortunately, with letters from attorneys, you rarely have enough time to force Google to re-index the page in order to check the pages again using a Google search. So another way of searching the website has to be found.

And the second challenge is: If you make the same mistake again after receiving a warning, it will usually be really expensive. For this reason I have to make double sure that the term is not overlooked somewhere on a subpage.

Offline reader SiteSucker loads all sub-pages onto the Mac

My solution was designed in such a way that I first wanted to load the entire website, including all subpages, onto my Mac and then let BBEdit search through it with the multi-file search.

However, if a website is based on a CMS like WordPress, you cannot simply download the pages via FTP, as these are dynamically built from template files and content from the database.

Help is provided by a free tool called "SiteSucker"(App Store Link), which runs on the Mac and gives you the opportunity to save a complete website with all sub-pages, graphics and other files. SiteSucker was originally intended to offer a kind of offline reader function. This means that you load the content of a website onto your Mac and the URLs are rewritten by the program so that you can also view the website locally offline on your Mac. It used to be useful when you were on vacation, but nowadays you have WiFi everywhere and no need for such programs.

For my purpose, SiteSucker is the perfect tool, because I want the entire website to be available locally so that I can search through the source code.

[appbox appstore 442168834]

Sensible settings at SiteSucker

Since many WordPress sites are now equipped with security plugins that block automated requests, you should set SiteSucker so that it always allows 3 to 5 seconds between the loading processes of the pages so that no plugin hits. Many hosters also have server-side firewalls that recognize when a bot sends several requests per second.

If you let SiteSucker go without restrictions on a website, there is a high chance that your own IP will be blocked and you will not be able to access the corresponding website for a few minutes.

For the delay in SiteSucker, I would recommend a value between 2 and 5 seconds so that no IP block is triggered by the hoster's firewall.

For the delay in SiteSucker, I would recommend a value between 2 and 5 seconds so that no IP block is triggered by the hoster's firewall.

You can also limit the requests by excluding - in my case - files of the type JS, JPEG, GIF, PNG and CSS. This should actually be very easy to do via the settings, but it didn't work for me.

At some point I worked with a few regular expressions that can also be used to exclude files and URLs. If you want to do that too, you can find the appropriate screenshot with the necessary entries here:

Files with certain extensions in the file name can be filtered out very easily with regular expressions.

Files with certain extensions in the file name can be filtered out very easily with regular expressions.

Now click on the start button and watch how SiteSucker works its way through the website and backs up all files one by one. You can also quickly see whether the regex (regular expressions) are working correctly, because all files are displayed live in the list. If you see JPG files there, something on the filter did not work.

A list shows how the individual SiteSucker processes work and which files are loaded.

A list shows how the individual SiteSucker processes work and which files are loaded.

Search website folders - with BBEdit

Once SiteSucker is through, you have a folder with all the HTML files in which the relevant "bad word" could be hidden. I searched these with the multi-file search function of BBEdit, because it opens the files and searches for the word in the source text.

I couldn't try whether Spotlight would work here too, there my spotlight for weeks has a quirk (Clean Install is due in the next few days!). But it worked with BBEdit without any problems and I think, in principle, Spotlight can also work with HTML content. The only question is whether he would also find words in the source text (image tags, etc.).

With BBEdit's multi-file search function, entire folders can be searched for occurrences of a word. Important: Do not check Case Sensitive.

With BBEdit's multi-file search function, entire folders can be searched for occurrences of a word. Important: Do not check Case Sensitive.

When searching in HTML files you have to keep in mind that you will miss words with umlauts on some websites because the HTML special character may have been used for the corresponding umlaut (see SelfHTML).

An example: instead of "Schmörebröt" can there too "Schmörebröt" to stand.

If you know that, you can change the search accordingly and should then find all occurrences. With this "list of references" I then went to the WordPress admin to clean up all the pages.

The results of BBEdit then serve as a working basis to revise the corresponding pages in WordPress.

The results of BBEdit then serve as a working basis to revise the corresponding pages in WordPress.

Possible pitfalls: image file names and image content

I only noticed two possible problem areas in the matter later: The word searched for can also be hidden in image file names or even in the image itself. The customer couldn't tell me whether just naming a file with the problematic term would be enough to cause problems again, so we decided to defuse everything.

The search for files with the corresponding name was also done locally in the "Uploads" folder of WordPress. For this I have to use the tool (because of the defective spotlight) Find Any File from Thomas Tempelmann, who had successfully completed the job in a fraction of a second.

Text that is embedded in a graphic and cannot be found using the text search is just as capable of issuing warnings as "real" text.

Text that is embedded in a graphic and cannot be found using the text search is just as capable of issuing warnings as "real" text.

The last construction site is the search in image content. That means photographs, banners or the like in which the searched word has been incorporated by image processing. These occurrences must also be removed. However, no tool I am familiar with helps here and you simply have to "scroll" through the graphics by hand with the preview.

Under certain circumstances, the Google image search can also be helpful if you want to check the photos and graphics of a single website.

Under certain circumstances, the Google image search can also be helpful if you want to check the photos and graphics of a single website.

I then "cleaned up" problematic graphics and photographs using Photoshop and reloaded them onto the server using FTP. Since I didn't want to revise every thumbnail, I only revised and exchanged the "large" image versions and then all thumbnails with the plugin "Force Regenerate Thumbnails"can be regenerated by Pedro Elsner.

Conclusion: A lot of work solved with a reasonably manageable effort

On the whole, things could be resolved in a reasonable time despite the many sub-pages and graphics. If you have to struggle with such problems and don't know how I can solve a certain task semi-automatically, write a short comment or email me directly. Maybe I can help you!

 

 

-
 

 

Effectively for free: iPhone 13 Mini and iPhone 13 deals with top conditions at Otelo - Advertisement

10 comments

  1. Enc says:

    Why not just search the MySQL database for the term with phpMyAdmin? At least with self-hosted Worspress sites that shouldn't be a problem ... or am I missing something?

    • Sir Apfelot says:

      In principle, this is also an approach, but you won't find any image files that have been uploaded but are no longer linked in a post. The Google image search and the opposing lawyer may find it. ;-)
      And I have some customers who still have themes where my predecessors programmed menus and small message boxes or changing headers directly into the theme code. They go through the mesh with it.

  2. Kenneth says:

    Thanks for the hint with the picture files.

    It would also have been interesting which word was warned (but you are no longer allowed to write that). Possibly there is still a danger in the comments.

    I wish the warning lawyer three months of constipation, his clients three months of diarrhea and flatulence.

    • Sir Apfelot says:

      Hello Kenneth! That was something very special: someone who attaches insulated sandwich profile sheets as a roof is not allowed to speak of "roofing", since this term is only reserved for roofers who also "cover roofing tiles". Don't ask me for details. : D
      And yes, theoretically there is also a danger in comments, but on the one hand my way of searching also finds mention in comments and on the other hand I always switch off the comment function on company websites. So in that respect there was no danger in the current case. And because of your wishes: Yes, I wish him that with you! : D

  3. Peter says:

    I actually find the reference to the database also useful when it comes to WordPress. Theoretically it is certainly possible that text strings are "drawn" from it before / for the delivery of the page (s), which then only appear in the source text of the pages as they are displayed by the visitor's browser, right? (Which are not in any PHP files of the WordPress installation on the server.)

    I recently had the case that - bypassing WordFence - some malware-suspicious URLs were smuggled into the "Description" fields of the site graphics by some hack. And this information should ONLY be in the database for the time being (ie not written into the image file) and from there on the server again be inserted into the page source text received by the visitor to the pages. Or am I wrong?

    • Sir Apfelot says:

      Hi Peter! Yes, of course you can also search the database, but there you can also find hits in revisions and in other places. In my opinion, the "Description" field is just a piece of text in the media library that helps you manage it. It is not displayed in the frontend with a picture. For this reason, malware URLs should not work there either. But if a hacker can post such URLs anywhere, caution is always advised. After all, he has somehow already gained access to your database or at least part of it. In such cases I always check whether a new user appears in the "User" area by chance. Has happened to me several times.

  4. Peter says:

    Indeed: Caution is advised ... However: there were actually no new users who were not set up by myself and the existing ones (except me) all only have the subscriber role, in which they are not allowed to have access to such things. There is also no comment function active on the site. The HOW of this code infiltration is therefore a great mystery to me. Maybe via the contact form?

    Somehow the URL in question must have somehow made it into the frontend, even if not immediately visible. At least Chrome was able to detect the presence on all pages that contained one of the affected graphics and (if not for all visitors to the page) issue a corresponding warning. On the other hand, I think that Firefox strangely did not produce any matches when searching for the URL in the source text (as it arrived at it) - very strange indeed!

    The field "Description" in the pictures in the media library was only marginally noted before and I never used it (ie it was always left blank). However, if I understood it correctly during my research, you should be able to read the content in the frontend, if the picture has a link to itself and you can then call it up with a click. Then the "Description" text should be an attachment to it, which is displayed with the picture (or something like that ...).

    • Sir Apfelot says:

      The malware usually comes in via some kind of plugin. Funny way even now and then about security holes in Wordfence, although the plugin is supposed to protect against that. But I've read about it several times and on two or three pages I was lucky enough to be able to trace the hacker back to the plugin by the date of the file changes. I haven't used Wordfence since then. : D

      Because of the descriptive text: Yes, maybe it uses some function. But when the image is large, most themes will display the caption for the image. But you never stop learning. Maybe I am wrong!

  5. Alex says:

    Interesting article, thank you very much. Was the work really successful and did the opposing lawyer keep his feet still afterwards or did something follow suit?
    How high was the effort involved until everything was really cleaned up?

    • Sir Apfelot says:

      Hello Alex! Yes, it was actually successful. Nothing came back from the lawyer, but I'm sure they looked. The second offense is always much more worthwhile for the other side. I can no longer estimate the effort. But I think it was easily half a day or more.

Leave a Comment

Your e-mail address will not be published. Required fields are marked with * .