Subscribe for updates on posts
Be the first to read the latest news

Create a Solr filter that replaces diacritics

August 28th, 2007 by Sebastian Mitroi in Java, General

Some languages (like Romanian) have special characters (diacritics, often called accent marks). It’s generally useful to remove diacritic marks from characters, for example when you create an index with Solr. You don’t want to index text with these characters because you want to find for example both words “propriet??i” and “proprietati”. If you are using Solr to index your text you have to create a Solr filter.
First of all you have to put the filter in the schema.xml configuration file :

<fieldtype name="text_st" class="solr.TextField" positionIncrementGap="100">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                // ... some other filters for example lower case filter
                <filter class="solr.LowerCaseFilterFactory"/>                
                <filter class="ro.tremend.solr.diacritics.DiacriticsFilterFactory"/>                


Then create 3 small classes and a properties file. The filter factory for Solr DiacriticsFilterFactory :

package ro.tremend.solr.diacritics;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

 * Create a Solr Filter Factory for diacritics
 * @author Sebastian
public class DiacriticsFilterFactory extends BaseTokenFilterFactory {
	public TokenStream create(TokenStream input) {
		return new DiacriticsFilter(input);

Now you have to create the filter class DiacriticsFilter :

package ro.tremend.solr.diacritics;

import org.apache.lucene.analysis.*;

 * Create the diacritics filter
 * @author Sebastian
public final class DiacriticsFilter extends TokenFilter {
	public DiacriticsFilter(TokenStream in) {

	public final Token next() throws IOException {
		Token t =;

		if (t == null)
			return null;

		return t;

and finally the class that does the work DiacriticsUtils :

package ro.tremend.solr.diacritics;

import java.util.HashMap;
import java.util.Map;
import java.util.MissingResourceException;
import java.util.ResourceBundle;
import java.util.Set;

 * Replace romanian characters
 * @author Sebastian
public class DiacriticsUtils {
	private static Map<String, String> diacritics = new HashMap<String, String>();

	static {
		// Get diacritics from
		try {
			ResourceBundle resource = ResourceBundle.getBundle("diacritics");
			Set keySet = resource.keySet();
			for (String key : keySet) {
				diacritics.put(key, resource.getString(key));
		} catch (MissingResourceException e) {

	 * Replace all diacritics in a string
	 * @param s the string
	 * @return the string without diacritics
	public static String replaceDiacritics(String s) {
		for (String key : diacritics.keySet()) {
			s = s.replaceAll(key, diacritics.get(key));
		return s;

	public static Map<String, String> getDiacritics() {
		return diacritics;

This class needs a properties file with the diacritics you want to replace:

... define all your language specific characters


Now the index will not contain diacritics, but you have to remove the diacritics from the query too. To do that just write this:

textToFind = DiacriticsUtils.replaceDiacritics(textToFind);


I hope this will help.

You might also like

How to set the default charset to utf-8 for create table when using hibernate with java persistence annotations Yesterday I encountered a problem when trying to persist a String value into a MySQL column of type 'text'...
Select/delete all items in Solr To select all items for a field in Solr you can use the query : some_item:, but if this field is missing...
End to end UTF-8 encoding usage with MySql and Spring Setting up a solution to store, manage and display UTF-8 data using MySql was quite a challenge.There...
Online AJAX based dictionary Check out the first Romanian online AJAX dictionary: Similar to Google Suggest,...

8 Responses

  1. Hoss Says:

    Unles you have a very specific need to define a custom list of diacritics to remove, the ISOLatin1AccentFilter that comes with Lucene (and the ISOLatin1AccentFilterFactory that comes with Solr) should solve this problem for you without any custom code.

    (ISOLatin1AccentFilter has had a lot of speed improvements in the trunk which should make it into the next version of Solr as well … see LUCENE-871 in apache Jira for more info)

  2. chetan Says:


    Great example
    I am new to solr and trying to understand as to where i should be applying this last update mentioned
    Now the index will not contain diacritics, but you have to remove the diacritics from the query too. To do that just write this:

    textToFind = DiacriticsUtils.replaceDiacritics(textToFind);

    I am building an application where I would need to replace an incoming string with another specific to our taxonmy
    Please advice

  3. Catalin Says:

    Even nicer would be to go beyond this Accents filter and to also use the romanian stemmer filter from Snowball.

    For example you search for:
    masini (romanian word for cars)
    and you also get results for:
    masina (romanian word for car)


    Keep up the good work.

  4. Alina Says:

    You can also replace diacritics with the base characters using java.text.Normalizer.
    This is useful when you don’t know which diacritics can appear. This code will always extract the base character.

    This is the code:
    public String toBase(String sText){
    boolean bChar = true;
    int iSize = sText.length(), i = 0;
    String sAux = “”;

    for (i=0; i < iSize; i++){
    String sLetter = new String(new char[]{sText.charAt(i)});

    sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFD);

    byte[] bLetter = (new String(sLetter)).getBytes(“UTF-8”);
    char cLetter = (char) bLetter[0];
    sAux += “” + cLetter;
    catch(Exception e){
    //do something
    return sAux;

  5. Sebastian Says:

    thanks for suggestion.

  6. heba Says:

    I’m trying to add a factory in solr for tokenizing Arabic text, but I
    receive some error (the one at the last of my email). can u help me solve this problem please.

    Here is my code:

    package org.apache.solr.analysis;

    import gpl.pierrick.brihaye.aramorph.lucene.ArabicTokenizer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.solr.analysis.BaseTokenizerFactory;

    public class ArabicTokenizerFactory extends BaseTokenizerFactory{
    public TokenStream create(Reader input) {
    return new ArabicTokenizer(input);

    Thanks in advance

    HTTP Status 500 – Severe errors in solr configuration. Check your log
    files for more detailed information on what may be wrong. If you want
    solr to continue after configuration errors, change:
    false in
    java.lang.VerifyError: (class:
    org/apache/solr/analysis/ArabicTokenizerFactory, method: create
    signature: (Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;)
    Wrong return type in function at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Unknown Source) at
    org.apache.solr.core.Config.findClass( at
    org.apache.solr.core.Config.newInstance( at
    :631) at
    org.apache.solr.schema.IndexSchema.readAnalyzer( at
    org.apache.solr.schema.IndexSchema.access$000( at
    org.apache.solr.schema.IndexSchema$1.create( at
    org.apache.solr.schema.IndexSchema$1.create( at
    org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoad at
    org.apache.solr.schema.IndexSchema.readSchema( at
    org.apache.solr.schema.IndexSchema.( at
    org.apache.solr.core.SolrCore.( at
    org.apache.solr.core.SolrCore.getSolrCore( at
    68) at
    org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFi at
    org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(Applicatio at
    org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilte at
    a:3635) at
    ) at
    va:760) at
    at org.apache.catalina.core.StandardHost.addChild(
    626) at
    :553) at
    at org.apache.catalina.startup.HostConfig.start( at
    1) at
    org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSu at
    org.apache.catalina.core.ContainerBase.start( at
    org.apache.catalina.core.StandardHost.start( at
    org.apache.catalina.core.ContainerBase.start( at
    at org.apache.catalina.startup.Catalina.start( at
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
    sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
    java.lang.reflect.Method.invoke(Unknown Source) at
    org.apache.catalina.startup.Bootstrap.start( at

  7. PeWu Says:


    I have successfully used this approach (ICU library):


  8. spl chars ~`!@#$%^&*()_+{}|:;"'?/ Says:

    FilterFactoryspl chars ~`!@#$%^&*()_+{}|:;”‘?/

    test message ¢ £

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.