You are here

README.txt in Feed Import 7.2

Same filename and directory in other branches
  1. 8 README.txt
  2. 7.3 README.txt
  3. 7 README.txt
FEED IMPORT

Project page: http://drupal.org/project/feed_import
Examples: http://drupal.org/node/1360374

------------------------------
Features
------------------------------

  -easy to use interface
  -alternative xpaths support and default value
  -ignore field & skip item import
  -multi value fields support
  -pre-filters & filters
  -some usefull provided filters
  -auto-import/delete at cron
  -import in specified time interval
  -import/export feed configuration
  -reports
  -add taxonomy terms to field (can add new terms)
  -process HTML pages
  -process CSV files
  -process JSON string
  -add image to field (used as filter)
  -custom settings on each feed process function
  -do not save info about imported items (usually used for one-time import)

------------------------------
About Feed Import
------------------------------

Feed Import module allows you to import content from XML/HTML/CSV files into
entities (like node, user, ...) using XPATH to fetch whatever you need.
You can create a new feed using php code in your module or you can use the
provided UI (recommended). If you have regular imports you can enable import
to run at cron. Now Feed Import provides six methods to process files:
    Normal  - loads the XML file with simplexml_load_file() and parses
              it's content. This method isn't good for huge files because
              needs very much memory.
    Chunked - gets chunks from XML file and recompose each item. This is a good
              method to import huge xml files. This cannot be used if parent
              xpath is based on properties. You cannot use parent xpath like
              //category[@type="new"]/items
              or
              //category/items[@posted="today"]
              but you can use xpaths like
              //category/items.
              Fields xpaths are normally.
    Reader  - reades xml file node by node and imports it. The parent xpath is
              limited to one attribute (the complex is something like
              //tagname[@attribute="value for this attribute"]).
    HTML    - converts HTML document to xml and then is imported like a normal
              xml file.
    CSV     - loads content line by line and imports it.
    JSON    - converts json to xml before importing it.

------------------------------
How Feed Import works
------------------------------

Step 1: Downloading xml file and creating items

  -if we selected processXML function for processing this feed then all
   xml file is loaded. We apply parent xpath, we create entity objects and we
   should have all items in an array.
  -if we selected processXMLChunked function for processing then xml file is
   read in chunks. When we have an item we create the SimpleXMLElement object
   and we create entity object. We delete from memory content read so far and we
   repeat process until all xml content is processed.
  -if we selected processXmlReader then xml is read node by node and imported.
  -if we selected processHTMLPage function then HTML is converted to XML and
   imported like processXML.
  -if we selected processCSV function then file is read line by line and
   imported.
  -if we selected processJSON then json will be converted to xml and imported.
  -if we selected another process function then we should take a look at that
   function

Step 2: Creating entities

Well this step is contained in Step 1 to create entity objects from
SimpleXMLElement objects using feed info:
We generate an unique hash for item using unique xpath from feed. Then for each
field in feed we apply xpaths until one xpath passes pre-filter. If there is an
xpath that passed we take the value and filter it. If filtered value is empty
(or isn't a value) we use default action/value. In this mode we can have
alternative xpaths. Example:

<Friends>
  <Friend type="bestfriend">Jerry</Friend>
  <Friend type="normal">Tom</Friend>
</Friends>

Here we can use the following xpaths to take friend name:
Friends/Friend[@type="bestfriend"]
Friends/Friend[@type="normal"]

If bestfriend is missing then we go to normal friend. If normal friend is
missing too, we can specify a default value like "Forever alone".

Step 3: Saving/Updating entities

First we get the IDs of generated hashes to see if we need to create a new
entity or just to update it.
For each object filled with data earlier we check the hash:
  -if hash is in IDs list then we check if entity data changed to see if we have
   to save changes or just to update the expire time.
  -if hash isn't in list then we create a new entity and hash needs to be
   inserted in database.
If is a one-time import then none of the items info is saved. This can produce
duplicate content.

Feed Import can add multiple values to fields which support this. For example
above we need only one xpath:
Friends/Friend
and both Tom and Jerry will be inserted in field values, which is great.
Now you can specify not only the column value but the entire array of values by
returning an object from filter functions.

Expire time is used to automatically delete entities (at cron) if they are
missing from feed for more than X seconds.
Expire time is updated for each item in feed. For performance reasons we make a
query for X items at once to update or insert.

------------------------------
Using Feed Import UI
------------------------------

First, navigate to admin/config/services/feed_import. You can change global
settings using "Settings" link. To add a new feed click "Add new feed" link and
fill the form with desired data. After you saved feed click "Edit" link from
operations column. Now at the bottom is a fieldset with XPATH settings. Add
XPATH for required item parent and unique id (you can now save feed). To add a
new field choose one from "Add new field" select and click "Add selected field"
button. A fieldset with field settings appeared and you can enter xpath(s) and
default action/value. If you wish you can add another field and when you are
done click "Save feed" submit button.
Check if all fields are ok. If you want to (pre)filter values select
"Edit (pre)filter" tab. You can see a fieldset for each selected field. Click
"Add new filter" button for desired field to add a new filter. Enter unique
filter name per field (this can be anything that let you quickly identify
filter), enter function name (any php function, even static functions
ClassName::functionName) and enter parameters for function, one per line.
To send field value as parameter enter [field] in line. There are some static
filter functions in feed_import_filter.inc.php file >> class FeedImportFilter
that you can use. Please take a look. I'll add more soon.
If you want to change [field] with somenthing else go to Settings.
You can add/remove any filters you want but don't forget to click "Save filters"
submit button to save all.
Now you can enable feed and test it.

------------------------------
Feed Import API
------------------------------

If you want, you can use your own function to parse content. To do that you have
to implement hook_feed_import_process_info() which returns an array keyed by
function alias and with value of another array containing following keys:
  function => Function name for processing. If function is a static member of a
              class then value is an array containing class name and function
              name. This function gets as parameter an array with feed info.
  settings => An array containing settings for process function keyed by setting
              name. A setting is an array with following keys and values:
                title       =>  This is displayed like textfield title.
                description =>  This is displayed as textfield description.
                default     =>  This is default SCALAR value for setting.
              If function doesn't need user settings then this is must be an
              empty array.
  validate => A function name (like "function" key above) which will validate
              all settings. Function gets as parameters: setting name, value and
              default value. If setting value isn't valid then function must
              return default value else return value. This function is called
              for every setting so you may have to use a switch statement.
              If you don't want to use this set it to NULL.
  info     => Some text describing process function, settings and others.

Please note that in process function EVERY possible exception MUST BE CAUGHT!

Example:
function hook_feed_import_process_info() {
  return array(
    'processFeedSuperFast' => array(
      'function' => 'php_process_function_name',
      'settings' => array(
        'my_setting' => array(
          'title' => t('Setting title'),
          'description' => t('My setting description'),
          'default' => 128,
        ),
        'other_setting' => array(
          'title' => t('The other setting'),
          'description' => t('Description for setting'),
          'default' => 'abcd',
        ),
        // Other settings...
      ),
      'validate' => 'php_process_function_validate',
      'info' => t('About this process function.'),
    ),
    // Other functions ...
  );
}

Every function is called with a parameter containing feed info and must return
an array of objects (stdClass). For example above we will have:

/**
 * Process feed function.
 */
function php_process_function_name(array $feed) {
  $items = array();
  // We can use settings like this:
  $my_setting = $feed['xpath']['settings']['my_setting'];
  $other_setting = $feed['xpath']['settings']['other_setting'];

  // ...
  // Here process feed items.
  // ...
  return $items;
}

/**
 * Callback for validate.
 */
function php_process_function_validate($name, $value, $default) {
  switch ($name) {
    case 'my_setting':
      if ($value < 10) {
        // Well, if you want you can use drupal_set_message('msg', 'warning') to
        // tell user that his value isn't valid
        drupal_set_message(t('Value must be at least 10!'), 'warning');
        return $default;
      }
      break;
    case 'other_setting':
      if ($value = 'xyz') {
        $value = 'test';
      }
      break;
  }
  return $value;
}

Please check source code for a good example.

------------------------------
Feed info structure
------------------------------

Feed info is an array containing all info about feeds: name, url, xpath keyed
by feed name.
A feed is an array containing the following keys:

id => This is feed unique id

machine_name => This is feed unique machine name

name => This is feed name

enabled => Shows if feed is enabled or not. Enabled feeds are processed at cron
           if import at cron option is activated from settings page.

url => URL to xml file. To avoid problems use an absolute url.

time => This contains feed items lifetime. If 0 then items are kept forever else
        items will be deleted after this time is elapse and they don't exist
        in xml file anymore. On each import existing items will be rescheduled.

entity_info => This is an array containing two elements

  #entity => Entity name like node, user, ...

  #table_pk => This is entity's table index. For node is nid, for
               user si uid, ...


xpath => This is an array containing xpath info and fields

  #root => This is XPATH to parent item. Every xpath query will run in
           this context.

  #uniq => This is XPATH (relative to #root xpath) to a unique value
           which identify the item. Resulted value is used to create a
           hash for item so this must be unique per item.

  #process_function => This is function alias used to process xml file.
                       See documentation above about process functions.

  #settings => This is an array containing settings for process function keyed
               by setting name.

  #items => This is an array containing xpath for fields and filters keyed by
            field name.
    [field_name] => An array containing info about field, xpath, filters

      #field => This is field name

      #column => This is column in field table. For body is value, for taxonomy
                 is tid and so on. If this field is a column in entity field
                 then this must be NULL.

      #xpath => This is an array containig xpaths for this field. Xpaths are
                used from first to last until one passes pre-filter functions.
                All xpaths are relative to #root.

      #default_value => This is default value for field if none of xpaths passes
                        pre-filter functions. This is used only for
                        default_value and default_value_filtered actions.

      #default_action => Can be one of (see FeedImport::getDefaultActions()):
          default_value           -field will have this value
          default_value_filtered  -field will have this value after was filtered
          ignore_field            -field will have no value
          skip_item               -item will not be imported

      #filter => An array containing filters info keyed by filter name
        [filter_name] => An array containing filter function and params
          #function => This is function name. Can also be a public static
                       function from a class with value ClassName::functionName
          #params => An array of parameters which #function recives. You can use
                     [field] (this value can be changed from settings page) to
                     send current field value as parameter.

      #pre_filter => Same as filter, but these functions are used to pre-filter
                     values to see if we have to choose an alternative xpath.


To see a real feed array, first create some feeds using UI and then you can use
code below to print its structure:
$feeds = FeedImport::loadFeeds();
drupal_set_message('<pre>' . print_r($feeds, TRUE) . '</pre>');

If you want to modify feed just before import process you can implement
hook_feed_import_feed_info_alter(&$feed) hook.

File

README.txt
View source
  1. FEED IMPORT
  2. Project page: http://drupal.org/project/feed_import
  3. Examples: http://drupal.org/node/1360374
  4. ------------------------------
  5. Features
  6. ------------------------------
  7. -easy to use interface
  8. -alternative xpaths support and default value
  9. -ignore field & skip item import
  10. -multi value fields support
  11. -pre-filters & filters
  12. -some usefull provided filters
  13. -auto-import/delete at cron
  14. -import in specified time interval
  15. -import/export feed configuration
  16. -reports
  17. -add taxonomy terms to field (can add new terms)
  18. -process HTML pages
  19. -process CSV files
  20. -process JSON string
  21. -add image to field (used as filter)
  22. -custom settings on each feed process function
  23. -do not save info about imported items (usually used for one-time import)
  24. ------------------------------
  25. About Feed Import
  26. ------------------------------
  27. Feed Import module allows you to import content from XML/HTML/CSV files into
  28. entities (like node, user, ...) using XPATH to fetch whatever you need.
  29. You can create a new feed using php code in your module or you can use the
  30. provided UI (recommended). If you have regular imports you can enable import
  31. to run at cron. Now Feed Import provides six methods to process files:
  32. Normal - loads the XML file with simplexml_load_file() and parses
  33. it's content. This method isn't good for huge files because
  34. needs very much memory.
  35. Chunked - gets chunks from XML file and recompose each item. This is a good
  36. method to import huge xml files. This cannot be used if parent
  37. xpath is based on properties. You cannot use parent xpath like
  38. //category[@type="new"]/items
  39. or
  40. //category/items[@posted="today"]
  41. but you can use xpaths like
  42. //category/items.
  43. Fields xpaths are normally.
  44. Reader - reades xml file node by node and imports it. The parent xpath is
  45. limited to one attribute (the complex is something like
  46. //tagname[@attribute="value for this attribute"]).
  47. HTML - converts HTML document to xml and then is imported like a normal
  48. xml file.
  49. CSV - loads content line by line and imports it.
  50. JSON - converts json to xml before importing it.
  51. ------------------------------
  52. How Feed Import works
  53. ------------------------------
  54. Step 1: Downloading xml file and creating items
  55. -if we selected processXML function for processing this feed then all
  56. xml file is loaded. We apply parent xpath, we create entity objects and we
  57. should have all items in an array.
  58. -if we selected processXMLChunked function for processing then xml file is
  59. read in chunks. When we have an item we create the SimpleXMLElement object
  60. and we create entity object. We delete from memory content read so far and we
  61. repeat process until all xml content is processed.
  62. -if we selected processXmlReader then xml is read node by node and imported.
  63. -if we selected processHTMLPage function then HTML is converted to XML and
  64. imported like processXML.
  65. -if we selected processCSV function then file is read line by line and
  66. imported.
  67. -if we selected processJSON then json will be converted to xml and imported.
  68. -if we selected another process function then we should take a look at that
  69. function
  70. Step 2: Creating entities
  71. Well this step is contained in Step 1 to create entity objects from
  72. SimpleXMLElement objects using feed info:
  73. We generate an unique hash for item using unique xpath from feed. Then for each
  74. field in feed we apply xpaths until one xpath passes pre-filter. If there is an
  75. xpath that passed we take the value and filter it. If filtered value is empty
  76. (or isn't a value) we use default action/value. In this mode we can have
  77. alternative xpaths. Example:
  78. Jerry
  79. Tom
  80. Here we can use the following xpaths to take friend name:
  81. Friends/Friend[@type="bestfriend"]
  82. Friends/Friend[@type="normal"]
  83. If bestfriend is missing then we go to normal friend. If normal friend is
  84. missing too, we can specify a default value like "Forever alone".
  85. Step 3: Saving/Updating entities
  86. First we get the IDs of generated hashes to see if we need to create a new
  87. entity or just to update it.
  88. For each object filled with data earlier we check the hash:
  89. -if hash is in IDs list then we check if entity data changed to see if we have
  90. to save changes or just to update the expire time.
  91. -if hash isn't in list then we create a new entity and hash needs to be
  92. inserted in database.
  93. If is a one-time import then none of the items info is saved. This can produce
  94. duplicate content.
  95. Feed Import can add multiple values to fields which support this. For example
  96. above we need only one xpath:
  97. Friends/Friend
  98. and both Tom and Jerry will be inserted in field values, which is great.
  99. Now you can specify not only the column value but the entire array of values by
  100. returning an object from filter functions.
  101. Expire time is used to automatically delete entities (at cron) if they are
  102. missing from feed for more than X seconds.
  103. Expire time is updated for each item in feed. For performance reasons we make a
  104. query for X items at once to update or insert.
  105. ------------------------------
  106. Using Feed Import UI
  107. ------------------------------
  108. First, navigate to admin/config/services/feed_import. You can change global
  109. settings using "Settings" link. To add a new feed click "Add new feed" link and
  110. fill the form with desired data. After you saved feed click "Edit" link from
  111. operations column. Now at the bottom is a fieldset with XPATH settings. Add
  112. XPATH for required item parent and unique id (you can now save feed). To add a
  113. new field choose one from "Add new field" select and click "Add selected field"
  114. button. A fieldset with field settings appeared and you can enter xpath(s) and
  115. default action/value. If you wish you can add another field and when you are
  116. done click "Save feed" submit button.
  117. Check if all fields are ok. If you want to (pre)filter values select
  118. "Edit (pre)filter" tab. You can see a fieldset for each selected field. Click
  119. "Add new filter" button for desired field to add a new filter. Enter unique
  120. filter name per field (this can be anything that let you quickly identify
  121. filter), enter function name (any php function, even static functions
  122. ClassName::functionName) and enter parameters for function, one per line.
  123. To send field value as parameter enter [field] in line. There are some static
  124. filter functions in feed_import_filter.inc.php file >> class FeedImportFilter
  125. that you can use. Please take a look. I'll add more soon.
  126. If you want to change [field] with somenthing else go to Settings.
  127. You can add/remove any filters you want but don't forget to click "Save filters"
  128. submit button to save all.
  129. Now you can enable feed and test it.
  130. ------------------------------
  131. Feed Import API
  132. ------------------------------
  133. If you want, you can use your own function to parse content. To do that you have
  134. to implement hook_feed_import_process_info() which returns an array keyed by
  135. function alias and with value of another array containing following keys:
  136. function => Function name for processing. If function is a static member of a
  137. class then value is an array containing class name and function
  138. name. This function gets as parameter an array with feed info.
  139. settings => An array containing settings for process function keyed by setting
  140. name. A setting is an array with following keys and values:
  141. title => This is displayed like textfield title.
  142. description => This is displayed as textfield description.
  143. default => This is default SCALAR value for setting.
  144. If function doesn't need user settings then this is must be an
  145. empty array.
  146. validate => A function name (like "function" key above) which will validate
  147. all settings. Function gets as parameters: setting name, value and
  148. default value. If setting value isn't valid then function must
  149. return default value else return value. This function is called
  150. for every setting so you may have to use a switch statement.
  151. If you don't want to use this set it to NULL.
  152. info => Some text describing process function, settings and others.
  153. Please note that in process function EVERY possible exception MUST BE CAUGHT!
  154. Example:
  155. function hook_feed_import_process_info() {
  156. return array(
  157. 'processFeedSuperFast' => array(
  158. 'function' => 'php_process_function_name',
  159. 'settings' => array(
  160. 'my_setting' => array(
  161. 'title' => t('Setting title'),
  162. 'description' => t('My setting description'),
  163. 'default' => 128,
  164. ),
  165. 'other_setting' => array(
  166. 'title' => t('The other setting'),
  167. 'description' => t('Description for setting'),
  168. 'default' => 'abcd',
  169. ),
  170. // Other settings...
  171. ),
  172. 'validate' => 'php_process_function_validate',
  173. 'info' => t('About this process function.'),
  174. ),
  175. // Other functions ...
  176. );
  177. }
  178. Every function is called with a parameter containing feed info and must return
  179. an array of objects (stdClass). For example above we will have:
  180. /**
  181. * Process feed function.
  182. */
  183. function php_process_function_name(array $feed) {
  184. $items = array();
  185. // We can use settings like this:
  186. $my_setting = $feed['xpath']['settings']['my_setting'];
  187. $other_setting = $feed['xpath']['settings']['other_setting'];
  188. // ...
  189. // Here process feed items.
  190. // ...
  191. return $items;
  192. }
  193. /**
  194. * Callback for validate.
  195. */
  196. function php_process_function_validate($name, $value, $default) {
  197. switch ($name) {
  198. case 'my_setting':
  199. if ($value < 10) {
  200. // Well, if you want you can use drupal_set_message('msg', 'warning') to
  201. // tell user that his value isn't valid
  202. drupal_set_message(t('Value must be at least 10!'), 'warning');
  203. return $default;
  204. }
  205. break;
  206. case 'other_setting':
  207. if ($value = 'xyz') {
  208. $value = 'test';
  209. }
  210. break;
  211. }
  212. return $value;
  213. }
  214. Please check source code for a good example.
  215. ------------------------------
  216. Feed info structure
  217. ------------------------------
  218. Feed info is an array containing all info about feeds: name, url, xpath keyed
  219. by feed name.
  220. A feed is an array containing the following keys:
  221. id => This is feed unique id
  222. machine_name => This is feed unique machine name
  223. name => This is feed name
  224. enabled => Shows if feed is enabled or not. Enabled feeds are processed at cron
  225. if import at cron option is activated from settings page.
  226. url => URL to xml file. To avoid problems use an absolute url.
  227. time => This contains feed items lifetime. If 0 then items are kept forever else
  228. items will be deleted after this time is elapse and they don't exist
  229. in xml file anymore. On each import existing items will be rescheduled.
  230. entity_info => This is an array containing two elements
  231. #entity => Entity name like node, user, ...
  232. #table_pk => This is entity's table index. For node is nid, for
  233. user si uid, ...
  234. xpath => This is an array containing xpath info and fields
  235. #root => This is XPATH to parent item. Every xpath query will run in
  236. this context.
  237. #uniq => This is XPATH (relative to #root xpath) to a unique value
  238. which identify the item. Resulted value is used to create a
  239. hash for item so this must be unique per item.
  240. #process_function => This is function alias used to process xml file.
  241. See documentation above about process functions.
  242. #settings => This is an array containing settings for process function keyed
  243. by setting name.
  244. #items => This is an array containing xpath for fields and filters keyed by
  245. field name.
  246. [field_name] => An array containing info about field, xpath, filters
  247. #field => This is field name
  248. #column => This is column in field table. For body is value, for taxonomy
  249. is tid and so on. If this field is a column in entity field
  250. then this must be NULL.
  251. #xpath => This is an array containig xpaths for this field. Xpaths are
  252. used from first to last until one passes pre-filter functions.
  253. All xpaths are relative to #root.
  254. #default_value => This is default value for field if none of xpaths passes
  255. pre-filter functions. This is used only for
  256. default_value and default_value_filtered actions.
  257. #default_action => Can be one of (see FeedImport::getDefaultActions()):
  258. default_value -field will have this value
  259. default_value_filtered -field will have this value after was filtered
  260. ignore_field -field will have no value
  261. skip_item -item will not be imported
  262. #filter => An array containing filters info keyed by filter name
  263. [filter_name] => An array containing filter function and params
  264. #function => This is function name. Can also be a public static
  265. function from a class with value ClassName::functionName
  266. #params => An array of parameters which #function recives. You can use
  267. [field] (this value can be changed from settings page) to
  268. send current field value as parameter.
  269. #pre_filter => Same as filter, but these functions are used to pre-filter
  270. values to see if we have to choose an alternative xpath.
  271. To see a real feed array, first create some feeds using UI and then you can use
  272. code below to print its structure:
  273. $feeds = FeedImport::loadFeeds();
  274. drupal_set_message('
    ' . print_r($feeds, TRUE) . '
    ');
  275. If you want to modify feed just before import process you can implement
  276. hook_feed_import_feed_info_alter(&$feed) hook.